0?^ 

Practitioner's Doelcet No perlin-8 



PATENT 



U. 



o 



Preliminary Classification: 
Proposed Class: 
Subclass: 

NOTE "Ail applicants are requested to indude a preiMnary dassification on newly filed patent 

applications. The preliminary classification, preferabiy dass and subdass designations, should be 
identified in the upper right-hand comer of the letter of transmittal accompanying the appiication 
papers, for example 'Proposed Class 2, subdass 129/ " M,P.EP. § 607, 7th ed. 



IN THE UNITED STATES PATENT AND TRADEMARK OFFICE 



o 

.CO 

^"■^ 

0 i 



Box Patent Application 

Assistant Commissioner for Patents 

Washington, D.C. 20231 

NEW APPLICATION TRANSMITTAL 

Transmitted herewith for filing is the patent application of 
lnventor(s): Mark W. Perlin 



WARNING: 37 C.F.R § 1A1(a)(1) points out: 

'(a) A patent is applied for in the name or names of the actual inventor or inventors. 

"(1) The inventorship of a nonprovisionaJ appiication is that inventorship set forth in the oath or 
declaration as prescribed by § 7.63, except as provided for in § 1, 53(d)(4) and § h63(d). If an 
oath or dedaradon as prescribed by § 7.63 is not filed during the pendency of a nonprovisionaJ 
application, the inventorship is that inventorship set forth in the application papers filed pursuant 
to § t53(b), unless a petition under this paragraph accompanied by the fee set forth in § t17(0 
is filed supplying or changing the name or names of the inventor or inventors," 



For (title): 



A METHOD AND SYSTEM FOR DNA ANALYSIS 



CERTIFICATSON UNDER 37 C.F.R. § 1.10* 
(Express Mai! label number is mandatory.) 
(Express Mail certification is opiionaL) 



I hereby certify that this New Application Transmittal and the documents refen-ed to as attached therein are being 

deposited with the United States Postal Sen/ice on this date February 15 y 2000 ^ in an envelope 

as "Express Mail Post Office to Addressee/ maiiing Label Number EL396485536US , ad- 



dressed to the: Assistant Commissioner for Patents, Washington, D.C. 20231. 

Tracey L. Milka 



{type orprint name of person mailing papei) 



e orpnj 

ture cf persQp mailing paper 



Signature 

WARNING: Certificate of mailing (first dass) or facsimile transmission procedures of 37 Cf.R, §1.8 cannot be 
used to obtain a date of mailing or transmission for this correspondence, 

^WARNING: Each paper or fee filed by "Express Mail" must have the number of the "Express Mail" mailing label 
placed thereon prior to mailing. 37 C.F.R. § 1,1 0(b), 

"Since the filing of correspondence under §1.10 without the Express Mail maiiing label thereon 
is an oversight that can be avoided by the exerdse of reasonable care, requests for waiver of this 
requirement will not be granted on petition. " Notice of Oct. 24, 1996, 60 Fed, Reg. 56,439, at 56,442. 

(New Application Transmittal [4-1]— page 1 of 11) 



1. Type of Application 

This new application is for a(n) 

(check one applicable item below) 

S Original (nonprovisional) 

□ Design 
□ Plant 

WARNING: Do not use this transmittal for a completion in the U.S. of an international Application under 35 
aS.C. § 371(cX4h unless the International Application is b©hgr filed as a divisional, continuation 
or continuation-in-part application, 

WARNING: Do not use this transmittal for the Hiing of a provisional application. 

NOTE: If one of the following 3 items apply, then complete and attach ADDED PAGES FOR NEW APPLICATION 
TRANSMfTTAL WHERE BENEFIT OF A PRIOR U.S. APPLICATION CLAIMED and a NOTJFfCAVON 
IN PARENT APPUCAVON OF THE FiUNG OF THIS CONTINUATiON APPUCAVON 

□ Divisional, 

□ Continuation. 

□ Continuation-in-part (C-l-P). 

2. Benefit of Prior U.S. Appiicatlon(s) (35 U.S.C. §§ 119(e), 120, or 121) 

NOTE* A nonprovisional application may claim an invention disclosed in one or more prior filed copending 
nonprovisional applications or copending international applications designating the United States of 
America. In order for a nonprovisional application to claim the t>enefit of a prior filed copending 
nonprovisional application or copending international application designating the United States of 
America, each prior application must name as an inventor at least one inverrtor named in the later Hied 
nonprovisiormi application and disclose the named inventor's invention dainred in at least one claim 
of the later filed nonprovisiond application in the manner provided by the first paragraph of 35 U.S.C. 
§ 112. Each prior application must also be: 

(i) An international application entitled to a filing date in accorcfance with POT Article 1 1 and 
desigrTating the United States of America; or 

(ii) Complete as set forth in § ?,57(t(); or 

(iii) Entitied to a filing date as set forth in § I.SSfjb) or § tS3(d) and include the bas/c filing fee set 
forth in § 1.16; or 

fiv) Entitied to a filing date as set forth in § 1,53t>) and have paid therein the processing and retention 
fee set forth in § 1,21§ within the time period set forth in § 1.53(f), 

37 C.ER. § 1.78(aX1). 

NOTE: If the new application being transmitted is a division^, continuation or a continuation-in-part of a parent 
case, or where the parent case is an International Application which designated the U.S., or benefit 
of a prior provisional application is daimed, then check the following item and complete and attach 
ADDED PAGES FOR NEW APPLICATION TRANSMTTTAL WHERE BENEFTT OF PRIOR U.S. APPUCA- 
TION(S) CLAIMED. 

WARNING: If an application claims tiie benefit of the filing date of an eariier filed application under 35 U.S.C. 

§§ 120, 121 or 365(c), the 2(hyear term of that application will be based upon the filing date of 
the eariiest U.S. application that ttie application makes reference to under 35 U.S.C. §§ 120, 121 
or 365(c). (35 U.S.C. § t54{aj0 does not take into account, for the determination of the patent 
term, any application on which priority is daimed under 35 U.S.C. §§ 119, 365(a) or 365(b).) For 
a c-4-p application, applicant ^ould review whether any daim in the patent that will issue is 
supported by an eariier application and, if not, the applicant should consider canceling tfie reference 
to the eariier filed application. The term of a patent is not based on a daim-by-daim approach. 
See Notice of April 14, 1995, 60 Fed. Reg. 20,195, at 20,205. 

(New Application Transmttta! [4-1] — page 2 of 11) 



WARNING: When the last day of pendency of a provisionai application fe/fe on a Saturday, Sunday, or Federal 
hoiiday within the District of Coiumbia, any nonprovisional application claiming benefit of the 
provisional appiicaHon must be filed prior to the Saturday, Sunday, or Federal hoiiday within the 
District of Columbia. See 37 C.ER § 1.78(a)(3). 

□ The new application being transnnitted claims the benefit of prior U.S. applica- 
tion(s). Enclosed are ADDED PAGES FOR NEW APPLICATION TRANSMITTAL 
WHERE BENEFIT OF PRIOR U.S. APPLICAT10N(S) GUMMED. 
L Papers Enclosed 

A. Required for filing date under 37 C.F.R. § 1.53(b) (Regular) or 37 C.F.R § 1.153 
(Design) Application 

Pages of specification 

Pages of clainns 



JlL Sheets of drawing 



WARNING: DO NOT submit original drawings. A high quaiity copy of the drawings should be supplied when 
filing a patent app/ication. The drawings that am submitted to the Office must be on strong, white, 
smooth, and non-shiny paper and meet tf?e standards according to § 1.84. If corrections to the 
drawings are necessary, they should be made to the original drawing and a high-quality copy of 
the connected original drawing then submitted to the Office. Only one copy is required or desired. 
For comments on proposed then-new 37 C.F.R. § 1.84, see Notice ofMan^ 9, 19B8 (1990 O.G. 
57-62). 

NOTE: "Identifying indicia, if provided, should include the application number or the title of the invention, 
inventor's name, dodket number (if any), and the name and telephone number of a person to call if 
the Office is unable to match the drawings to the proper application. This infonnation should be placed 
on the, back of each sheet of drawing a minimum distance of 1.5 cm, (5/8 Inch) down from the top 
of the page . . 37 C.RR. § 1.84(c)). 

(complete the following, if applicable) 

□ The enclosed drawing(s) are photograph(s), and there is also attached a 
"PETITION TO ACCEPT PHOTOGRAPH(S) AS DRAWING{S)." C7 C.F.R. 
§ 1.84(b). 

□ formal 
HI infonmal 

B- Other Papers Enclosed 

2 

Pages of declaration and power of attorney 



_ Pages of abstract 
J- Other 



4. Additional papers enclosed 

□ Amendment to claims 

□ Cancel in this applications claims ^ before 

calculating the filing fee. (At least one original independent claim must be 
retained for filing purposes,) 

□ Add the claims shown on the attached amendment (Claims added have 
been numbered consecutively following the highest numbered original 
claims.) 

□ Preliminary Amendment 

□ Infonnation Disclosure Statement (37 C.F.R. § 1.98) 

□ Fomn PTO-1449 (PTO/SB/08A and 08B) 

□ Citations 



(New Application Transmittal [4-1]— page 3 of 11) 



□ Declaration of Biological Deposit 

□ Submission of "Sequence Listing," computer readable copy and/or amendment 
pertaining thereto for biotechnology invention containing nucleotide and/or 
amino acid sequence. 

□ Authorization of Attomey(s) to Accept and Follow Instmctions from Representa- 
tive 

□ Special Comments 

□ Other 

5, Declaration or oath (including power of attorney) 

NOTE: A newly executed declaration is not required in a continuation or divisional application provided that 
the prior nonppDvisional application contained a declaration as required, the application being filed is 
by ail or fewer than all the inventors named in the prior application, there is no new matter in the 
application being filed, and a copy of the executed declaration filed in the prior application (showing 
the signature or an indication thereon that it was signed) is submitted. The copy must be accompanied 
by a statement requesting deletion of the names of person(s) who are not inventors of the application 
being filed. If the declaration in the prior application was filed under § 1.47, then a copy of that 
declaration must be filed accompanied by a copy of the decision granting §1.47 status or, if a nonsigning 
person under § 1.47 has subsequently joined in a prior application, then a copy of the subsequentiy 
executed declaration must be filed. See 37 C.RR. §§ t63(d)(1H3). 

NOTE: A declaration filed to complete an application must 6e executed, identify the spedfication to which it 
is directed, identify each inventor by full name including family name and at least one given name, without 
abbreviation together with any other given name or initiai, and the residence, post office address and 
country or citizenship of each inventor, and state whether the inventor is a sole or joint inventor. 37 
C.ER. § 1.63(a)(1H4k ^ 

NOTE: The inventorship of a nonprovisionaJ application is that inventorship set forth in the oath or declaration 
as prescribed by § 1.62, except as pnDvided for in § 1.53(d)(4) and § 1.63(d). If an oath or dedaration 
as prescribed by§ 1.63 is not filed during the pendency of a nonpnDvisional applicaiion, the inventorship 
is that inventorship set forth in the application papers filed pursuant to J 1.53^), unless a petition under 
this paragraph accompanied by the fee set forth in § 1.17(1) is filed supplying or changing the name 
or names of the inventor or inventors." 37 C.F.R § 1.41(a)(1). 

K\ Enclosed 

Executed by 

(check all applicable boxes) 

B lnventor{s), 

□ legal representative of inventor{s). 
37 C.RR. §§ '1.42 or 1.43.. 

□ joint inventor or person showing a proprietary 
interest on behalf of inventor who refused to sign 
or cannot be reached. 

□ This is the petition required by 37 C.F.R. § 1 .47 and the statement 
required by 37 C.F.R. § 1.47 is also attached. See item 13 below 
for fee. 

□ Not Enclosed. 

NOTE: Where the filing is a completion in the U.S. of an International Application or where the completion of 
the U.S. application contains subject matter in addition to ttye International Application, the application 
may be treated as a continuation or continuation-in-part, as the case may tte, utilizing ADDED PAGE 
FOR NEW APPUCATION TRANSMrfTAL WHERE BENEFfF OF PRIOR U.S. APPUOATiON CLAIMED. 

□ Application is made by a person authorized under 37 C.F.R. § 1, 41(c) on 
behalf of all the above named inventor(s). 

(New Application Transmittal [4-1}— page 4 of 11) 



(The declaration or oath, along with the surcharge required by 37 C.RR § 1, 16(e) 

can be filed subsequently). 



□ Showing that the filing is authorized, 

(not required unless called into question, 37 C.F.R 5 1.41(d)) 

6. Inventorship Statement 

WARNING: If the named inventors are each not the inventors of ail the claims an explanation, including the 
ownership of the various claims at the time the last ciaimed invention was made, should be 
submitted. 

The inventorship for all the claims in this application are: 
IXI The same. 

or 

□ Not the same. An explanation, including the ownership of the various claims at 
the time the last claimed invention was made, 

□ i§ submitted. 

□ will be submitted. 

7. Language 

NOTE- An application including a signed oath or declaration may be Wed in a language other than English, 
An English translation of the non-English language application and the processing fee of $130,00 
^ required by 37 C.F.R § tJ7(f^ /s required to be fifed with the application, or within such time as may 
be set by the Office, 37 aER § 1,52(d). 

a English 

□ Non-English 

□ The attached translation includes a statement that the translation is accu- 
. rate. 37 C.F.R. § 1.52(d). 

8. Assignnient 

□ An assignment of the invention to ^ 



□ is attached. A separate □ "COVER SHEET FOR ASSIGNMENT (DOCU- 
MENT) ACCOMPANYING NEW PATENT APPLICATION" or □ FORM PTO 
1595 is also attached. 

□ will follow. 

NOTE: "If an assignment is submitted with a new application, send two separate letters-one for the application . 
and one for the assignment," Notice of May 4, 1990 (1114 O.G. 77-78;. 

WARNING: A newly executed "CERTIFICATE UNDER 37 C.F.R § 3.73(br must be filed when a continuation^ 
in-part application is filed by an assignee. Notice of April 30, 1993, 1150 O.G. 62-64. 



(New Application Transmrttail [*-tl— page 5 of 11) 



9. Certified Copy 

Certified copy{ies) of application(s) 



Country 


Appin. No. 


Filed 


Country 


Appln. No. 


Filed 



Country Appln. No. Ried 



from which priority is claimed 

□ is (are) attached. 

□ will follow. 

NOTE: The foreign application forming the basis for the daim for priority must be referred to in the oath or 
decfaramn. 37 C.F^R § 1.55(a) and 1.63. 

NOTE: This item is for any foreign priority for which the application being filed directly relates. If any parent 
U.S. application or intemationai Application from which this application claims ber^efit under 35 U.S.O. 
§ 120 is itself entitled to priority from a prior foreign application, then complete item 18 on the ADDED 
PAGES FOR NEW APPUCATJON TRANSMfTTAL WHERE BENEFFT OF PRIOR U.S. APPUCAVON{^) 
CLAIMED. 

10. Fee Calculation (37 C.F.R, § 1.16) 
A. \M Regular appiication 



CLAI^fS AS FILED 


Number filed 




Number Extra 


Rate 


Basic Fee 
37 C.F.R. § 1.16(a) 
$760.00 


Total 

Claims (37 C.F.R. 
§ 1.16(c)) 


22 - 


20 = 2 X 


$ 18.00 


36.00 


Independent 
Claims (37 C.F.R, 
§ 1.16(b)) 


6 - 


3 = 3 X 


$ 78.00 


234.00 


Multiple' dependent claim(s), 
if any (37 C.F.R. § 1.16(d)) 


+ 


$260.00 





□ Amendment cancelling extra claims is enclosed. 
Amendment deleting multiple-dependencies is enclosed, 

□ Fee for extra claims is not being paid at this time. 

NOTE; If the fees for extra daims are not paid on filing they must be paid or the daims cancelled by amendment, 
^ prior to Uje expiration of the tinye period set for response by the Patent and Trademark Office in any 
noticB of fee defidency. 37 C.F.R § 1.16(d). 

Filing Fee Calculation $ ,960.00 

B. □ Design application 

($310.00-37 C.F.R. § 1.16(f)) 

Filing Fee Calculation $_ 

(New Application Transmittai [4-1}— page 6 of 11) 



C, □ Plant application 

($480.00-^7 C.F.R § 1.16(g)) 

Filing fee calculation $ 

11. Small Entity Statement(s) 

® Statement(s) that this is a filing by a small entity under 37 C.F.R. § 1 .9 and 1 .27 
is (are) attached. 

WARNING: ''Status as a smaii entity must be spedficaily established in each appiication or patent in which 
the status is avaiiabfe and desired. Status as a smaii entity in one application or patent does not 
affect any other application or patent, including applications or patents which are directly or 
indirectiy dependent upon the application or patent in which the status has been established. The 
refiling: of an application under § 1.53 as a continuation^ division, or continuation-in-part including 
a continued prosecution application under § 1.53(dji), or the filing of a reissue application requires 
a new determination as to continued entitfement to smalt entity status for the continuing or reissue 
application. A nonprovisional application claiming benefit under 35 U.S.C, § 1 19(e), 120, 121, or 
365(c) of a prior application, or a reissue application may rely on a statement filed in the prior 
application or in i/?e patent if the nonprovisional appiication or the reissue application includes a 
reference to the statement in the prior application or in the patent or Includes a copy of the 
statement in the prior application or in the patent and status as a small entity is stiU proper and 
desired. The payment of the small entity basic statutory Hiing fee wilt be treated as such a reference 
for purposes of this section." 37 C.F.R § 1.2B(a)(2), 

WARNING: "Small entity status must not fee established when the person or persons signing the. . . statement 
can unequbfocalty ma/fe the required self-certjficatibn.'' M,P.E.P., § 509.03, 6th ed., rev. 2, July 
1996 (emphasis added). 

(complete the following, if applicable) 

□ Status as a smaii entity was claimed in prior application 

/ , filed on . from which benefit 

is being claimed for this appiication under 

35 U.S.C. § □ 119(e), 

□ 120, 

□ 121, 

□ 365(c), 

and which status as a small entity is still proper and desired. 
□ A copy of the statement in the prior application is included. 
Filing Fee Calculation (50% of A, B or C above) 
^ 480.00 

NOTE: Any excess of tfie full fee paid will be refunded if small entity status is established and a refund request 
are filed within 2 months of the date of timely payment of a full fee. The two-month period is not 
extendable under § 1J36. 37 C.ER. § 1.28(a). 

12. Request for International-Type Search (37 C.F.R. § 1.104(d)) 

(complete, if applicable) 

□ Please prepare an international-type search report for this application at the time 
when national examination on the merits takes place. 



(New Application Transmrttai [4-11 — page 7 of 11) 



13. Fee Payment Being Made at This Time 

□ Not Enclosed 

□ No filing fee is to be paid at this time, 

(This and the surcharge required by 37 C.F.R § 1, 16(e) can be paid 
subsequentfy.) 

0 Enclosed 

m r-T X ^ 480,00 
B Filing fee $ — 

□ Recording assignment 
($40.00; 37 C.F.R § 1.21(h)) 

(See attached "COVER SHEET FOR 
ASSIGNMENT ACCOMPANYING NEW 

APPLICATION",) $ 

□ Petition fee for filing by other than all the 
inventors or person on behalf of the inventor 
where inventor refused to sign or cannot be 
reached 

($130.00; 37 C.F.R §§ 1.47 and 1.17(0) $ 

□ For processing an application with a 
specification in 

a non-English language 

($130.00; 37 C.F.R §§ 1.52(d) and 1.1 7(k)) $ 

□ Processing and retention fee 

($130.00; 37 C.F.R §§ 1.53(d) and 1.21(1)) $ 

□ Fee for international-type search report 

($40.00; 37 C.F.R. § 1.21(e)) $ 



NOTE: 37 C.F.R § t210 estabiishes a fee for processing and retaining any appiication that is abandoned for 
Ming to complete the appiication pursuant to 37 C.F.R, § 1.53(9 ^^'S, as weil as the changes to 
37 C.F.R §§ 1.53 end 1.78(aX1), indicate that in order to obtain the benefit of a prior U,S. application, 
either the t)asic filing fee must be paid, or the processing and reten^on fee of § 1.21 if) must be paid, 
within 1 year from notification under § 53(fj. 

Total fees enclosed $ ^^^'^^ 

14. Method of Payment of Fees 

m Check in the amount of $ ^80.00 



□ Charge Account No. in the amount of 

$ 

A duplicate of this transmittal is attached. 

NOTE: Fees should 6e itemized in such a manner that it is dear for whi<^ purpose the fees are paid, 37 C.F.R. 
§ 1.22(b). 



^ew Application Transmrttsd [4-11— page 8 of 11) 



15. Authorization to Charge Additional Fees 

WARNING: If no fees are to be paid on filing, the foilowing items should not tie compieted. 

WARNING: Accurately count daimSf especiaily multiple dependent claims, to avoid unexpected high chargeSr 
if extra daim charges are authorized, 

S The Commissioner is hereby authorized to charge the foilowing additional fees 
by this paper and during the entire pendency of this application to Account No. 

19-0737 : 

m 37 C.F.R. § 1.16(a). (f) or (g) (filing fees) 

a 37 C.F.R. § 1.16(b). (c) and (d) (presentation of extra claims) 

NOTE Because additional fees for excess or muftipie dependent claims not paid on filing or on later presentation 
must only be paid or these claims cancelled by amendment prior to the expiration of the time period 
set for response by ^ PTO in any notice of fee deficiency ^7 CF.R § 7. 16(0), it might be best not 
to authorize the PTO to charge additionai daim fees, except possitily when dealing with amendments 
after final action, 

□ 37 C.F.R. § 1 .1 6(e) (surcharge for filing the basic filing fee and/or declaration 
on a date later than the filing date of the application) 

□ 37 C.F.R. § 1.17{a)(1)-{5) (extension fees pursuant to § 1.136(a)). 

□ 37 C.F.R. § 1.17 (application processing fees) 

NOTE: " . A written request may be submitted in an application that is an authorization to treat any concurrent 
or future reply, requiring a pe^'on for an extension of time under this paragraph for its tiniely submission, 
as incorporating a petition for extension of time for the appropriate length of time. An authorization to 
charge all required fees, fees under § 1,17, or all required extension of time fees will be treated as a 
constructive petition for an extension ot Hme in any concurrent or future reply requiring a petition for 
an extension of time under this paragraph for its timely submission. Submission of the fee set forth in 
§1.1 7{a) will also be treated as a constructive petition for an extension of time In any concurrent reply 
requiring a petition for an extension of time under this paragraph for its tim^y submission," 37 C.F.R 
§ 1, 136(a)(3). 

□ 37 C.F.R. § 1.18 (issue fee at or before mailing of Notice of Allowance, 
pursuant to 37 C.F.R. § 1.311(b)) 

NOTE: Where an authorization to charge the issue fee to a deposit account has tteen filed before the mailing 
of a Notice of Allowance, the issue fee wilt be automatically charged to the deposit account at the time 
of mailing the no&'ce of allowance, 37 C.F.R § 1,31 1(b), 

NOTE: 37 C.F.R § 1.2$(b) requires 'Notificadon of any diange In status resuf^ng in loss of entitlement to smalt 
entity status must be filed in the application . . , prior to paying, or at the time of paying, , , .the issue 
fee. . . * From the wording of 37 C,F,R. § 1, 28(b), (^ notification of change of status must be made 
even if the fee is paid as "other than a small entity" and (b) no notification is required if the change 
is to another small entity. 

(New Application Transmittal [4-1] — page 9 of 11) 



16. Instructions as to Overpayment 

NOTE; *. , . Amounts of twenty-five dollars or fess wilt not be returned unless specifically requested within 
a reasonable time, nor will the payer be notified of such amounts; amounts over twenty-five dollars may 
be returned by check or, if requested, by credit to a deposit account" 37 C.ER. § 1.2S(a). 

a Credit Account No. 19-0737 

□ Refund 




Reg. No. 30,587 



Tel. No. (412) 621-9222 



Customer No. 



SIGHATURE OF PRACTITIONER 

Ansel M. Schwartz 



{type or print name of attorney^ 

One Sterling Plaza 
201 N> Craig Street 



P.O. Address 

Suite 304 
Pittsburgh, PA 



15213 



Jview Application Transmittal [4-11— page 10 of 11) 



Incorporation by reference of added pages 

(check the following item if the application in this transmittal claims the beneftt of 
prior U.S. application(s) (including an international application entering the US. 
stage as a continuation, divisional or C-l-P application) and complete and attach 
the ADDED PAGES FOR NEW APPLICATION TRANSMfTTAL WHERE BENEFIT OF 
PRIOR US, APPLICATIONS) CLAIMED) 

□ Plus Added Pages for New Application Transmittal Where Benefit of Prior U.S. 
Application(s) Claimed 

Number of pages added 

□ Plus Added Pages for Papers Referred to in Item 4 Above 

Number of pages added 

□ Plus added pages deleting names of inventor(s) named in prior application(s) 
who is/are no longer inventor(s) of the subject matter claimed in this application. 

Number of pages added 

□ Plus "Assignment Cover Letter Accompanying New Application" 

Number of pages added 

Statement Ylhere No Further Pages Added 

f/f no further pages form a part of this Transmittal, then end this Transmittal with 
this page and check the Mowing item) 

Q This transmittal ends with this page. 



(New Application Transmfttal [4-1J— page 11 of 11) 



Attorney's Docket No. FERLlN-8 PATENT 

Applicant or Patentee: Mark W. Perlin 

Application or Patent No.; / 

Filed or Issued: 

c^p. A METHOD AND SYSTEM FOR DM ANALYSIS 



VERIFIED STATEMENT (DECLARATION) CLAIMING SMALL ENTITY 
STATUS (37 CFR l,9(f) and L27(b))— INDEPENDENT INVENTOR 

As a below named inventor, I hereby declare that 1 qualify as an independent inventor, as 
defined in 37 CFR 1.9(c), for purposes of paying reduced fees under Sections 41(a) and 
(b) of Title 35, United States Code, to the Patent and Trademark Office with regard to the 
invention entitled A METHOD AND SYSTEM FOR DNA ANALYSIS 

described in 

E the specification filed herewith. 

□ application no. / , filed 

□ patent no , issued 

I have not assigned, granted, conveyed or licensed, and am under no obligation under 
contract or law to assign, grant, convey or license, any rights in the invention to any person 
who could not be classified as an independent inventor under 37 CFR 1.9(c), if that person 
had made the invention, or to any concern that would not qualify as a small business 
concern under 37 CFR 1.9(d), or a nonprofit organization under 37 CFR 1.9(e). 

Each person, concem or organization to which I have assigned, granted, conveyed, or 
licensed or am under an obligation under contract or law to assign, grant, convey, or license 
any rights in the invention is listed below: 

U no such person, concem, or organization. 

□ persons, concerns or organizations listed below * 

*NOTB: Separate verified statements are required from each named person, concem or organization having 
rights to the invention averring to their status as small entities. (37 CFR 1,27) 

FULL NAME 

ADDRESS 



□ INDIVIDUAL □ SMAa BUSINESS CONCERN □ NONPROFn" ORGANIZATION 

FULL NAME 

ADDRESS 



□ INDIVIDUAL □ SMALL BUSINESS CONCERN □ NONPROFn" ORGANIZATION 

FULL NAME^ _^ 

ADDRESS 



□ INDIVIDUAL 



□ SMALL BUSINESS CONCERN □ NONPROFfr ORGANIZATION 

Small Entity— Independent Inventor [7-1] — page 1 of 2) 



I acknowledge the duty to file, in this application or patent, notification of any change in 
status resulting in loss of entitlement to small entity status prior to paying, or at the time 
of paying, the earliest of the issue fee or any maintenance fee due after the date on which 
status as a small entity is no longer appropriate, (37 CFR 1 .28(b)) 

I hereby declare that all statements made herein of my own knowledge are true and that 
all statements made on infonmation and belief are believed to be true; and further, that these 
statements were made with the knowledge that willful false statements and the like so made 
are punishable by fine or imprisonment, or both, under Section 1 001 of Title 1 8 of the United 
States Code, and that such willful false statements may Jeopardize the validity of the 
application, any patent issuing thereon, or any patent to which this verified statement is 
directed, 

Mark W. Perlin 

Name of inventor 



Date 



Signature of Inventor 



Name of inventor 

Date 

Signature of Inventor 



Name of inventor 

Date 

Signature of Inventor 



Small &itrty— independent Inventor page 2 of 2) 



A METHOD AND SYSTEM FOR DNA ANALYSIS 



FIELD OF THE INVKTSTTTON 

The present invention pertains to a process for 
analyzing a DNA molecule. More specifically, the present 
invention is related to performing experiments that produce 
quantitative data, and then analyzing these data to characterize a 
DNA fragment. The invention also pertains to systems related to 
this DNA fragment information. 

BACKGROUND OF THE INVRNTTON 

With the advent of high- throughput DNA fragment analysis 
by electrophoretic separation, many useful genetic assays have 
been developed. These assays have application to genotyping, 
linkage analysis, genetic association, cancer progression, gene 
expression, pharmaceutical development, agricultural improvement, 
human identity, and forensic science. 

However, these assays inherently produce data that have 
signficant error with respect to the size and concentration of the 
characterized DNA fragments. Much calibration is currently done 
to help overcome these errors, including the use of in-lane 
molecular weight size standards. In spite of these improvements, 
the variability of these properties (between different 
instruments, runs, or lanes) can exceed the desired tolerance of 
the assays. 

Recently, advances have been made in the automated 
scoring of genetic data. Many naturally occurring artifacts in 
the amplification and separation of nucleic acids can be 
eliminated through calibration and mathematical processing of the 
data on a computing device (MW Perlin, MB Burks, RC Hoop, and EP 



Hoffman, "Toward fully automated genotyping: allele assignment, 
pedigree construction, phase determination, and recombination 
detection in Duchenne muscular dystrophy," Am. J. Hum. Genet., 
vol. 55, no. 4, pp. 777-787, 1994; MW Perlin, G Lancia, and S-K 
Ng, "Toward fully automated genotyping: genotyping microsatellite 
markers by deconvolution, " Am. J. Hum. Genet., vol. 57, no. 5, pp. 
1199-1210, 1995; S-K Ng, "Automating computational molecular 
genetics: solving the microsatellite genotyping problem," Carnegie 
Mellon University, Doctoral dissertation CMU-CS-98-105, January 
23, 1998), incorporated by reference. 

This invention pertains to the novel use of calibrating 
data and mathematical analyses to computationally eliminate 
undesirable data artifacts in a nonobvious way. Specifically, the 
use of allelic ladders and coordinate transformations can help an 
automated data analysis system better reduce measurement 
variability to within a desired assay tolerance. This improved 
reproducibility is useful in that it results in greater accuracy 
and more complete automation of the genetic assays, often taking 
less time at a lower cost with fewer people. 



Genotyping TechnolooY 

Genotyping is the process of determining the alleles at 
an individual's genetic locus. Such loci can be any inherited DNA 
sequence in the genome, including protein-encoding genes and 
polymorphic markers. These markers include short tandem repeat 
(STR) sequences, single-nucleotide polymorphism (SNP) sequences, 
restriction fragment length polymorphism (RFLP) sequences, and 
other DNA sequences that express genetic variation (G Gyapay, J 
Morissette, A Vignal, C Dib, C Fizames, P Millasseau, S Marc, G 
Bernardi, M Lathrop, and J Weissenbach, "The 1993-94 Genethon 
Human Genetic Linkage Map," Nature Genetics, vol. 7, no. 2, pp. 



-3- 

246-339, 1994; PW Reed, JL Davies, JB Copeman, ST Bennett, SM 
Palmer, LE Pritchard, SCL Gough, Y Kawaguchi, HJ Cordell, KM 
Balfour, SC Jenkins, EE Powell, A Vignal, and JA Todd, 
"Chromosome-specific microsatellite sets for fluorescence-based, 
5 semi-automated genome mapping," Nature Genet., vol. 7, no. 3, pp. 
390-395, 1994; L Kruglyak, "The use of a genetic map of biallelic 
markers in linkage studies," Nature Genet., vol. 17, no. 1, pp. 
21-24, 1997; D Wang, J Fan, C Siao, A Berno, P Young, R Sapolsky, 
G Ghandour, N Perkins, E Winchester, J Spencer, L Kruglyak, L 
10 Stein, L Hsie, T Topaloglou, E Hubbell, E Robinson, M Mittmann, M 
Morris, N Shen, D Kilburn, J Rioux, C Nusbaum, S Rozen, T Hudson, 
and E Lander, "Large-scale identification, mapping, and genotyping 
2 of single-nucleotide polymorphisms in the human genome, " Science, 
m vol. 280, no. 5366, pp. 1077-82, 1998; P Vos, R Rogers, M Bleeker, 
J5 M Reijans, T van de Lee, M Homes, A Frijters, J Pot, J Peleman, M 
Kuiper, and M Zabeau, "AFLP: a new technique for DNA 
fingerprinting," Nucleic Acids Res, vol. 23, no. 21, pp. 4407-14, 
1995; J Sambrook, EF Fritsch, and T Maniatis, Molecular Cloning, 
O Second Edition. Plainview, NY: Cold Spring Harbor Press, 1989), 
'|0 incorporated by reference. 

The polymorphism assay is typically done by 
characterizing the length and quantity of DNA from an individual 
at a marker. For example, STRs are assayed by polymerase chain 

25 reaction (PGR) amplification of an individual's STR locus using a 
labeled PGR primer, followed by size separation of the amplified 
PGR fragments. Detection of the fragment labels, together with 
in- lane size standards, generates a signal that permits 
characterization of the size and quantity of the DNA fragments. 

30 From this characterization, the alleles of the STR locus in the 
individual's genome can be determined (J Weber and P May, 
"Abundant class of human DNA polymorphisms which can be typed 
using the polymerase chain reaction," Am. J. Hum. Genet., vol. 44, 
pp. 388-396, 1989; JS Ziegle, Y Su, KP Gorcoran, L Nie, PE 



Mayrand, LB Hoff , LJ McBride, MN Kronick, and SR Diehl, 
"Application of automated DNA sizing technology for genotyping 
microsatellite loci," Genomics, vol. 14, pp. 1026-1031, 1992), 
incorporated by reference. 

The labels can use radioactivity, fluorescence, 
infrared, or other nonradioactive labeling methods (FM Ausubel, R 
Brent, RE Kingston, DD Moore, JG Seidman, JA Smith, and K Struhl, 
ed.. Current Protocols in Molecular Biology. New York, NY: John 
Wiley and Sons, 1995; NJ Dracopoli, JL Haines, BR Korf, CC Morton, 
CE Seidman, JG Seidman, DT Moir, and D Smith, ed. , Current 
Protocols in Human Genetics. New York: John Wiley and Sons, 1995; 
LJ Kricka, ed. , Nonisotopic Probing, Blotting, and Sequencing, 
Second Edition. San Diego, CA: Academic Press, 1995), incorporated 
by reference. 

Size separation of fragment molecules is typically done 
using gel or capillary electrophoresis (CE) ; newer methods include 
mass spectrometry and microchannel arrays (RA Mathies and XC 
Huang, "Capillary array electrophoresis: an approach to high- 
speed, high- throughput DNA sequencing," Nature, vol. 359, pp. 167- 
169, 1992; KJ Wu, A Stedding, and CH Becker, "Matrix-assisted 
laser desorption time-of- flight mass spectrometry of 
oligonucleotides using 3-hydroxypicolinic acid as an ultraviolet- 
sensitive matrix," Rapid Commun. Mass Spectrom. , vol. 7, pp. 142- 
146, 1993), incorporated by reference. 

The label detection method is contingent on both the 
labels used and the size separation mechanism. For example, with 
automated DNA sequencers such as the PE Biosystems ABI/377 gel, 
ABI/310 single capillary or ABI/3700 capillary array instruments, 
the detection is done by laser scanning of the f luorescently 
labeled fragments, imaging on a CCD camera, and electronic 
acquisition of the signals from the CCD camera. Flatbed laser 



scanners, such as the Molecular Dynamics Fluor imager or the 
Hitachi FMBIO/II acquire flourescent signals similarly. Li-Cor's 
infrared automated sequencer uses a detection technology modified 
for the infrared range. Radioactivity can be detected using film 
or phosphor screens. In mass spectrometry, the atomic mass can be 
used as a sensitive label. See (A. J. Kostichka, Bio /Technology, 
vol. 10, pp. 78, 1992), incorporated by reference. 

Size characterization is done by comparing the sample 
fragment's signal in the context of the size standards. By 
separate calibration of the size standards used, the relative 
molecular size can be inferred. This size is usually only an 
approximation to the true size in base pair units, since the size 
standards and the sample fragments generally have different 
chemistries and electrophoretic migration patterns (S-K Ng, 
"Automating computational molecular genetics: solving the 
microsatellite genotyping problem, " Carnegie Mellon University, 
Doctoral dissertation CMU-CS-98-105, January 23, 1998), 
incorporated by reference. 

Quantitation of the DNA signal is usually done by 
examining peak heights or peak areas. One inexact peak area 
method simply records the area under the curve; this approach does 
not account for band overlap between different peaks. It is often 
useful to determine the quality (e.g., error, accuracy, 
concordance with expectations) of the size or quantity 
characterizations. See (DR Richards and MW Perlin, "Quantitative 
analysis of gel electrophoresis data for automated genotyping 
applications," Amer. J. Hum. Genet., vol. 57, no. 4 Supplement, 
pp. A26, 1995), incorporated by reference. 

The actual genotyping result depends on the type of 
genotype, the technology used, and the scoring method. For 
example, with STR data, following size separation and 



-6- 



characterization, the sizes (exact, rounded, or binned) of the two 
tallest peaks might be used as the alleles. Alternatively, PGR 
artifacts (e.g., stutter, relative amplification) can be accounted 
for in the analysis, and the alleles determined after mathematical 
5 corrections have been applied. See (MW Perlin, "Method and system 
for geno typing, '■ U.S. Patent #5,541,067, Jul. 30, 1996; MW Perlin, 
"Method and system for genotyping, " U.S. Patent #5,580,728, Dec. 
3, 1996), incorporated by reference. 

10 Genotyping Applications 

^ Genotyping data can be used to determine how mapped 

J markers are shared between related individuals. By correlating 
i ^'^^^ sharing information with phenotypic traits, it is possible to 
W ■'-^^^li^® ^ gene associated with that inherited trait. This 
m approach is widely used in genetic linkage and association studies 

(J Ott, Analysis of Hijman Genetic Linkage, Revised Edition. 
O Baltimore, Maryland: The Johns Hopkins University Press, 1991; N 

Risch, "Genetic Linkage and Complex Diseases, With Special 
^ Reference to Psychiatric Disorders," Genet. Epidemiol., vol. 7, 
S pp. 3-16, 1990; N Risch and K Merikangas, "The future of genetic 
^ studies of complex human diseases," Science, vol. 273, pp. 1516- 

1517, 1996), incorporated by reference. 

25 Genotyping data can also be used to identify 

individuals. For example, in forensic science, DNA evidence can 
connect a suspect to the scene of a crime. DNA databases can 
provide a repository of such relational information (CP Kimpton, P 
Gill, A Walton, A Urquhart, ES Millican, and M Adams, "Automated 

30 DNA profiling employing multiplex amplification of short tandem 
repeat loci," PGR Meth. Appl., vol. 3, pp. 13-22, 1993; JE McEwen, 
"Forensic DNA data banking by state crime laboratories, " Am. J. 
Hum. Genet., vol. 56, pp. 1487-1492, 1995; K Inman and N Rudin, An 



-7- 



Introduction to Forensic DNA Analysis. Boca Raton, FL: CRC Press, 
1997; CJ Fregeau and Fourney, "DNA typing with f luorescently 
tagged short tandem repeats: a sensitive and accurate approach to 
hiiman identification," Biotechniqaes, vol. 15, no. 1, pp. 100-119, 
5 1993), incorporated by reference. 

Linked genetic markers can help predict the risk of 
disease. In monitoring cancer, STRs are used to assess 
microsatellite instability (MI) and loss of heterozygosity (LOH) - 
10 chromosomal alterations that reflect tumor progression. (ID 
Young, Introduction to Risk Calculation in Genetic Counselling. 
Oxford: Oxford University Press, 1991; L Cawkwell, L Ding, FA 
Jl Lewis, I Martin, MF Dixon, and P Quirke, "Microsatellite 
'5 instability in colorectal cancer: improved assessment using 
J5 fluorescent polymerase chain reaction," Gastroenterology, vol. 
=;y 109, pp. 465-471, 1995; F Canzian, A Salovaara, P Kristo, RB 

Chadwick, LA Aaltonen, and A de la Chapelle, " Semiautomated 
, assessment of loss of heterozygosity and replication error in 
g tumors," Cancer Research, vol. 56, pp. 3331-3337, 1996;S 
^0 Thibodeau, G Bren, and D Schaid, "Microsatellite instability in 
^fl cancer of the proximal colon," Science, vol. 260, no. 5109, pp. 
rlj 816-819, 1993), incorporated by reference. 



For crop and animal improvement, genetic mapping is a 
25 very powerful tool. Genotyping can help identify useful traits of 
nutritional or economic importance. (HJ Vilkki, DJ de Koning, K 
Elo, R Velmala, and A Maki-Tanila, "Multiple marker mapping of 
quantitative trait loci of Finnish dairy cattle by regression," J. 
Dairy Sci,, vol. 80, no. 1, pp. 198-204, 1997; SM Kappes, JW 
30 Keele, RT Stone, RA McGraw, TS Sonstegard, TP Smith, NL Lopez- 

Corrales, and CW Beattie, "A second-generation linkage map of the 
bovine genome," Genome Res., vol. 7, no. 3, pp. 235-249, 1997; M 
Georges, D Nielson, M Mackinnon, A Mishra, R Okimoto, AT Pasquino, 
LS Sargeant, A Sorensen, MR Steele, and X Zhao, "Mapping 



-8- 



quantitative trait loci controlling milk production in dairy 
cattle by exploiting progeny testing," Genetics, vol, 139, no. 2, 
pp. 907-920, 1995; GA Rohrer, LJ Alexander, Z Hu, TP Smith, JW 
Keele, and CW Beattie, "A comprehensive map of the porcine 
5 genome," Genome Res,, vol. 6, no. 5, pp. 371-391, 1996; J Hillel, 
"Map-based quantitative trait locus identification," Poult. Sci., 
vol. 76, no. 8, pp. 1115-1120, 1997; HH Cheng, "Mapping the 
chicken genome, " Poult. Sci., vol. 76, no. 8, pp. 1101-1107, 
1997), incorporated by reference. 

10 

Other Sizing Assays 

m Fragment analysis finds application in other genetic 

:3 methods. Often fragment sizes are used to multiplex many 
;?5 experiments into one shared readout pathway, where size (or size 
m range) serves an index into post-readout demultiplexing. For 

example, multiple genotypes are typically pooled into a single 
Q lane for more efficient readout. Quantifying information can help 
determine the relative amounts of nucleic acid products present in 
tissues. (GR Taylor, JS Noble, and RF Mueller, "Automated 
O analysis of multiplex microsatellites, " J. Med. Genet., vol. 31, 
=^ pp. 937-943, 1994; LS Schwartz, J Tarleton, B Popovich, WK 

Seltzer, and EP Hoffman, "Fluorescent multiplex linkage analysis 
and carrier detection for Duchenne/ Becker muscular dystrophy, " Am. 
25 J. Hum. Genet., vol. 51, pp. 721-729, 1992; CP Kimpton, P Gill, A 
Walton, A Urquhart, ES Millican, and M Adams, "Automated DNA 
profiling employing multiplex amplification of short tandem repeat 
loci," PGR Meth. Appl., vol. 3, pp. 13-22, 1993), incorporated by 
reference . 

30 

Differential display is a gene expression assay. It 
performs a reverse transcriptase PGR (RT-PCR) to capture the state 
of expressed mRNA molecules into a more robust DNA form. These 



"9- 



DNAs are then size separated, and the size bins provide an index 
into particular molecules. Variation at a size bin between two 
tissue assays is interpreted as a concommitant variation in the 
underlying mRNA gene expression profile. A peak quantification at 
5 a bin estimates the underlying mRNA concentration. Comparison of 
the quantitation of two different samples at the same bin provides 
a measure of relative up- or down-regulation of gene expression. 
(SW Jones, D Cai, OS Weislow, and B Esmaeli-Azad, "Generation of 
multiple mRNA fingerprints using fluorescence-based differential 
10 display and an automated DNA sequencer," BioTechniques , vol. 22, 
no. 3, pp. 536-543, 1997; P Liang and A Pardee, "Differential 
display of eukaryotic messenger RNA by means of the polymerase 
^ chain reactions," Science, vol. 257, pp, 967-971, 1992; KR 
m Luehrsen, LL Marr, E van der Knaap, and S Cumber ledge, "Analysis 
'■|5 of differential display RT-PCR products using fluorescent primers 
u| and Genescan software," BioTechniques, vol. 22, no. 1, pp. 168- 
30 174, 1997), incorporated by reference. 

O Single stranded conformer polymorphism (SSCP) is a 

'■|0 method for detecting different mutations in a gene. Single base 
Ui pair changes can markedly affect fragment mobility of the 
y conformer, and these mobility changes can be detected in a size 
separation assay. SSCP is of particular use in identifying and 
diagnosing genetic mutations (M Orita, H Iwahana, H Kanazawa, K 
25 Hayashi, and T Sekiya, "Detection of polymorphisms of human DNA by 
gel electrophoresis as single-strand conformation polymorphisms, " 
Proc Natl Acad Sci USA, vol. 86, pp. 2766-2770, 1989), 
incorporated by reference. 

30 The AFLP technique provides a very powerful DNA 

fingerprinting technique for DNAs of any origin or complexity. 
AFLP is based on the selective PCR amplification of restriction 
fragments from a total digest of genomic DNA. The technique 
involves three steps: (i) restriction of the DNA and ligation of 



-10- 



oligonucleotide adapters, (ii) selective amplification of sets of 
restriction fragments, and (iii) gel analysis of the amplified 
fragments. PGR amplification of restriction fragments is achieved 
by using the adapter and restriction site sequence as target sites 
for primer annealing. The selective amplification is achieved by 
the use of primers that extend into the restriction fragments, 
amplifying only those fragments in which the primer extensions 
match the nucleotides flanking the restriction sites. Using this 
method, sets of restriction fragments may be visualized by PGR 
without knowledge of nucleotide sequence. The method allows the 
specific co-amplification of high nijmbers of restriction 
fragments. The number of fragments that can be analyzed 
simultaneously, however, is dependent on the resolution of the 
detection system. Typically 50-100 restriction fragments are 
amplified and detected on denaturing polyacrylamide gels. (P Vos, 
R Hogers, M Bleeker, M Reijans, T van de Lee, M Homes, A 
Frijters, J Pot, J Peleman, M Kuiper, and M Zabeau, "AFLP: a new 
technique for DNA fingerprinting," Nucleic Acids Res, vol. 23, no. 
21, pp. 4407-14, 1995), incorporated by reference. 

Data Scoring 

The final step in any fragment assay is scoring the 
data. This is typically done by having people visually review 
every experiment. Some systems (e.g., PE Informatics' Genotyper 
program) perform an initial computer review of the data, to make 
the manual visual review of every genotype easier. More advanced 
systems (e.g., Cybergenetics ' TrueAllele technology) fully 
automate the data review, and provide data quality scores that can 
be used to identify data artifacts (for eliminating such data from 
consideration) and rank the data scores (to focus on just the 2%- 
25% of suspect data calls) . See (B Palsson, F Palsson, M Perlin, 
H Gubjartsson, K Stefansson, and J Gulcher, "Using quality 



-11- 



measures to facilitate allele calling in high- throughput 
geno typing, " Genome Research, vol. 9, no. 10, pp. 1002-1012, 1999; 
MW Perlin, "Method and system for geno typing, " U.S. Patent 
#5,876,933, Mar. 2, 1999), incorporated by reference. 

However, even with such advanced scoring technology, 
artifacts can obscure the results. More importantly, insufficient 
data calibration can preclude the achievement of very low (e.g., 
<1%) data error rates, regardless of the scoring methods. For 
example, in high- throughput STR genotyping, differential migration 
of a sample's PGR fragments relative to the size standards can 
produce subtle shifts in detected size. This problem is worse 
when different instruments are used, or when size separation 
protocols are not entirely uniform. The result is that fragments 
can be incorrectly assigned to allele bins in a way that cannot be 
corrected without recourse to additional information (e.g., 
pedigree data) completely outside the STR sizing assay. 

Whole System 

This^ invention centers on a new way to greatly reduce 
sizing and quantitation errors in fragment analysis. By designing 
data generation experiments that include the proper calibration 
data (e.g., internal lane standards, allelic ladders, uniform run 
conditions), most of these fragment analysis errors can be 
eliminated entirely. Moreover, computer software can be devised 
that fully exploits these data calibrations to automatically 
identify artiifacts and rank the data by quality. The result is a 
largely error- free system that requires minimal (if any) human 
inteirvention . 



-12- 



SUMMARY OF THE INVENTION 

The present invention pertains to a method for analyzing 
a nucleic acid sample. The method comprises the steps of forming 
labeled DNA sample fragments from a nucleic acid sample. Then 
there is the step of size separating and detecting said sample 
fragments to form a sample signal. Then there is the step of 
forming labeled DNA ladder fragments corresponding to molecular 
lengths. Then there is the step of size separating and detecting 
said ladder fragments to form a ladder signal. Then there is the 
step of transforming the sample signal into length coordinates 
using the ladder signal. Then there is the step of analyzing the 
nucleic acid sample signal in length coordinates. 

The present invention also pertains to a system for 
analyzing a nucleic acid sample. The system comprises means for 
forming labeled DNA sample fragments from a nucleic acid sample. 
The system further comprises means for size separating and 
detecting said sample fragments to form a sample signal, said 
separating and detecting means in communication with the sample 
fragments. The system further comprises means for forming labeled 
DNA ladder fragments corresponding to molecular lengths. The 
system further comprises means for size separating and detecting 
said ladder fragments to form a ladder signal, said separating and 
detecting means in communication with the ladder fragments. The 
system further comprises means for transforming the sample signal 
into length coordinates using the ladder signal, said transforming 
means in communication with the signals. The system further 
comprises means for analyzing the nucleic acid sample signal in 
length coordinates, said analyzing means in communication with the 
transforming means. 

The present invention also pertains to a method for 
generating revenue from computer scoring of genetic data. The 



-13- 

method comprises the steps of supplying a software program that 
automatically scores genetic data. Then there is the step of 
forming genetic data that can be scored by the software program. 
Then there is the step of scoring the genetic data using the 
5 software program to form a quantity of genetic data. Then there 
is the step of generating a revenue from computer scoring of 
genetic data that is related to the quantity. 

The present invention also pertains to a method for 
10 producing a nucleic acid analysis. The method comprises the steps 
of analyzing a first nucleic acid sample on a first size 
separation instrument to form a first signal. Then there is the 
^ step of analyzing a second nucleic acid sample on a second size 
IjI separation instrument to form a second signal. Then there is the 
^5 step of comparing the first signal with the second signal in a 
,Ly computing device with memory to form a comparison. Then there is 
step of producing a nucleic acid analysis of the two samples 
from the comparison that is independent of the size separation 
'Q instruments used. 

9o 

w1 The present invention also pertains to a method for 

^ resolving DNA mixtures. The method comprises the steps of 

obtaining DNA profile data that include a mixed sample. Then 

there is the step of representing the data in a linear equation. 
25 Then there is the step of deriving a solution from the linear 

equation. Then there is the step of resolving the DNA mixture 

from the solution. 



-14- 

BRIEF DESCRIPTION OF THE DRAWINGS 
Figure 1 shows the steps of creating sized profiles. 

5 Figure 2 shows unimodal plots, each corresponding to a fluorescent 
dye; within each plot, intensity is plotted against the sampled 
spectrum. 

Figure 3 shows the unimodality constraint determining the function 
10 space geometry of the spectrum sampling vectors. 

Figure 4 shows the results of signal processing, size tracking, 
and size transformation. 

g Figure 5 shows the steps of quantitating and analyzing genetic 

ij^ data . 

^ Figure 6 shows the results of ladder processing, peak quantitation 

O and allele calling. 

g) 

IM Figure 7 shows a graphical user interface for navigating 

'd. prioritized genotyping results. 

Figure 8 shows a textual interface for displaying useful genotype 
25 results. 

Figure 9 shows a visualization that is customized to a data 
artifact. 

30 Figure 10 shows a system for analyzing a nucleic acid sample. 

Figure 11 shows the result of a differential display gene 
expression analysis. 



-15- 

Figure 12 shows the flow graph of automated software assembly. 

Figure 13 shows a spreadsheet for calculating the labor cost of 
scoring genetic data. 

Figure 14 shows a heuristic function dev-(g(w)) which has an 
unambiguous local minimis. 



-15" 



DESCRIPTION OF THE PREFERRED EMBODIMENT 

Data Generation 

In the most preferred embodiment, geno typing data is 
generated using STR markers. These tandem repeats include mono-, 
di-, tri-, tetra-, penta-, hexa-, hepta-, octa-, nona-, deca- (and 
so on) nucleotide repeat elements. STRs are highly abundant and 
informative marker distributed thoughout the genomes of many 
species (including hxoman) . Typically, STRs are labeled, PGR 
amplified, and then detected (for size and quantity) on an 
electrophoretic gel. 

The laboratory processing starts with the acquisition of 
a sample, and the extraction of its DNA. The extraction and 
purification are typically followed by PGR amplification. 
Labelling is generally done using a 5' labeled PGR primer, or with 
incorporation labeling in the PGR. Prior to loading, multiple 
marker PGR products in k-1 different fluorescent colors are 
pooled, and size standards (preferably in a k^^ different color) is 
added. Size separation and detection is preferably done using 
automated fluorescent DNA sequencers, with either slab gel or 
capillary technology. The detected signals represent the 
progression of DNA bands as a function of time. These signals are 
transmitted from the sequencer to a computing device with memory, 
where they are stored in a file. (NJ Dracopoli, JL Haines, BR 
Korf , GG Morton, GE Seidman, JG Seidman, DT Moir, and D Smith, 
ed., Gurrent Protocols in Human Genetics. New York: John Wiley and 
Sons, 1995), incorporated by reference. 

To create STR allelic ladders, the most preferred 
embodiment entails PGR amplification of pooled samples. This can 
be done by preparing DNA from N (preferably, N is between 2 and 



-17- 



200, depending on the application) individuals in equimolar 3 
ng/ul concentrations. These DNAs are then pooled. After 
dilution, each PGR template contains contains 48 ng of DNA in an 
18 ul volume, and is included in a standard 50 ul PGR containing 
5 2*5 units of Amplitaq Gold, 1.25ul of each primer, 200uM dNTP's, 
and 2.5mM MgCl2- This mixture is then PGR amplified with its STR 
primers (one labeled) on a thermocycler (e.g., with an MJ Research 
PTG-100, use 30 cycles of 94oc for 1,25', 55^C for 1', and 72°G 
for 1') . Size separation of the PGR products on an ABI sequencer 
10 includes internal lane size standards (GS500 ROX-labeled 50 bp 
sizing ladder, Perkin-Elmer, Foster Gity, GA; 20 bp MapMarkers 
sizing ladder, BioVentures, Murfreesboro, TN) . Files are 
similarly recorded from this experiment. 

Q5 In an alternative preferred embodiment, multiplexed SNP 

y] data is generated using size standards with standard protocols. 
Is Typically, each size bin in the electrophoretic signal corresponds 

to one marker or polymorphism. Presence in the size bin of a 
Q signal of sufficient strength and correct color indicates the 
''ko presence of an allele; absence of the signal indicates allele 
]q absence. Signal size and color establish the allele, while signal 
□ strength determines the amount of DNA present (if any) . (NJ 

Dracopoli, JL Haines, BR Korf , GG Morton, GE Seidman, JG Seidman, 
DT Moir, and D Smith, ed, , Gurrent Protocols in Human Genetics. 
25 New York: John Wiley and Sons, 1995), incorporated by reference. 

In another alternative preferred embodiment, 
differential display data is generated, preferably as follows; 
Slaking cDNA. The mRNA differential display reverse 
30 transcription-polymerase chain reaction (DDRT-PCR) is performed 
using reagents supplied in a RNAimage™ kit (GeneHunter Gorp, 
Nashville, TN) . RNA duplicates from two tissue samples are 
reverse- transcribed using oligo(dT) primers and MMLV reverse 
transcriptase (RTase) . For a 20 ul reaction, adding (in order) 



-18- 



9.4 ul of H2O, 4 ul of 5x reaction buffer, 1.6 ul of 250 iiM each 
dNTPs, 2 ul of 2 uM one oligo(dT) primer (H-TllM: M is C, or 
A), and 2 ul of 0.1 ug/ul DNA-free total RNA. The RT reactions 
are run on a thermocycler (e.g., MJ Research, eS^'C for 5 min, 37°C 

5 for 60 min, and VS'^C for 5 min) . MMLV RTase (1 ul, 100 units) are 

added in the reaction after incubation for 10 min at SV'C. A 
control is included without adding RTase. 

Amplification and labeling. For a 20 ul PGR reaction, add 9.2 
ul of H2O, 4 ul of lOx PGR buffer, 1.6 ul of 25 uM each dNTPs, 2 ul 
10 of 2 uM H-TllM oligo(dT) primer, 2 ul of 2 uM of arbitrary primer 

(AP) , 2 ul of corresponding H-TllM reverse transcribed cDNA, and 
g 0.2 ul (1 unit) of AmpliTaq DNA polyrnerase (Perkin Elmer, Norwalk, 
jii GT) . Each of the three different H-TllM PGR primers are labeled 
4; with its own spectrally distinct fluorescent dye, such as FAM, 
^5 HEX, and NED. The cDNAs are randomly amplified by low-stringency 
m PGR (40 cycles with temperature at 94''C for 15 sec, 40°G for 2 min, 

C and 72°G for 2 min) in an MJR PGT/100 thermocycler. A final 

extension is performed at 72''C for 10 min. Samples (without added 

P RTase or cDNA) are simultaneously tested as controls. Multiple 
^§0 primer sets can be used. For example, 24 sets of primers (8 AP x 
3 H-TllM) are used in each kit; using 10 kits for screening 
differentially expressed cDNA tags produces 240 reactions per 
tissue. 

Size separation. Size standards in another dye are then added 
25 to the amplified labeled products, and then size separated on a 
manual or automated sequencing gel (or capillary) instrument. 
Differential display data generation protocols have been well 
described (NJ Dracopoli, JL Haines, BR Korf , GC Morton, GE 
Seidman, JG Seidman, DT Moir, and D Smith, ed. , Gurrent Protocols 
30 in Human Genetics. New York: John Wiley and Sons, 1995), 
incorporated by reference. 



-19- 

There are other alternative preferred embodiments for 
generating DNA fragment data whose assay includes a size 
separation, such as Amplification Refractory Mutation System 
5 (ARMS), Single-Strand Conformation Polymorphism (SSCP) , Restricion 
Fragment Length Polymorphism (RFLP) , and Amplified Fragment 
Length Polymorphism (AFLP) . These have been enumerated, with 
associated protocols (NJ Dracopoli, JL Haines, BR Korf , CC Morton, 
CE Seidman, JG Seidman, DT Moir, and D Smith, ed. , Current 
10 Protocols in Human Genetics* New York: John Wiley and Sons, 1995; 
ABI/377 and ABI/310 GeneScan Software and Operation Manuals, PE 
Biosytems, Foster City, CA) , incorporated by reference, 

2 Extracting Profiles 

f£ Once DNA fragment sizing data have been generated, the 

data are then analyzed to characterize the and quantity of the 
component fragments. 

20 Referring to Figure 1, Step 1 is for acquiring the data. 

"-^ The process begins by reading in the generated data from 

their native file formats, as defined by the DNA sequencer 
manufacturer. Let n be the number of lanes or capillaries, and m 

25 be the number of frequency acquisition channels. Capillary 

machines typically produce files where each file represents either 
all m channels of one capillary, or one channel of one capillary. 
Gel-based instruments typically produce files where one file 
represents either all m channels of the entire image, or one 

30 channel of one image. Intermediate cases, with the number of 
channels per file between 1 and m, can occur as well. 



-20- 



Once the signals have been read in, capillary input data 
signals are preferably stored in an nXm structure of one 

dimensional arrays in the memory of a computing device. This 
structure contains the signal profiles, with each array element 
5 corresponding to one channel of one capillary. Gel data are 

preferably stored as m two dimensional data arrays, one for each 
acquisition frequency. 

The computer software preferably integrates with current 
10 sequencer and CE technology. It preferably has two manufacturer- 
independent input modules: one for sequencer gel data (e.g., PE 
^ Biosystems ABI/377, Molecular Dynamics Fluorimager) , and one for 
C| CE data (e.g., PE Biosystems ABI/310, SpectriMedix SCE/9600) . 

These modules are extensible and flexible, and preferably handle 
j'5 any known sequencer or CE data in current (or future) file 
^ formats . 

^ Referring to Figure 1, Step 2 is for processing the signal. 

1^0 In this step, basic signal processing is done, such as 

2^ baseline removal or filtering (e.g., smoothing) the data. 

In the preferred embodiment for one dimensional signals, 
baseline removal is done using a sliding window technique. Within 

25 each window (of, say, 10 to 250 pixels, depending on the average 
number of pixel samples per base pair units) , a minimum value is 
identified. Using overlapping windows, a cubic spline is fit 
through the minimum points, creating a new baseline function that 
describes the local minima in each window neighborhood. To remove 

30 the baseline, this new baseline function is subtracted away from 
the original function. 



-21- 



In the preferred embodiment for two dimensional images, 
the baseline is removed as follows. A local neighborhood 
overlapping tiling is imposed on the image, and minimum values 
identified. Create a baseline surface from these local minima, 
5 and subtract this baseline surface from the image to remove the 
baseline from the image* 

In the preferred embodiment, filtering and other 
smoothing is done using convolution. A convolution kernel (such 

10 as a gaussian or a binomial function) is applied across the one 
dimensional capillary signal, or the two dimensional scanned gel 
image. The radius of smoothing depends on number of pixels per 

^ base pair unit denser sampling requires less smoothing. 

{i; However, with overly dense sampling, the data size can be reduced 

as by filtering out redundant points. 

3 Referring to Figure 1, Step 3 is for separating the colors. 

CT: A key element of fluorescent genetic analysis is 

^0 separating the fluorescent dye signals. In the current art, this 
is done by: 

O (1) Performing a dye standard calibration experiment using 

known dyes, often in separate lanes. A (dye color vs. 
frequency detection) classification matrix C is known 

25 directly from which dye is used. Each column of C contains 

a "1" in the row corresponding to the known color, with all 
other entries set to " 0 " . 
(2) Measuring the signals at separate fluorescent detection 
frequencies. These frequencies correspond to the "filter 

30 set" of the sequencer. For each pure dye peak, the signals 

that are measured across all the detection frequencies 
reveal the "bleedthrough" pattern of one dye color into its 
neighboring frequencies. Each pattern is normalized, and 
stored as a column in a data matrix D. The j^^ coluimi of D 



-22- 



is the "spectral bleedthrough" system response to the 
impulse input function represented in the jth column of C. 

(3) The relationship between C and D is described by the 
linear response dye calibration matrix M as: 

D = M X C 

To determine the unknown dye calibration matrix M (or its 
inverse matrix M"!) from the data, apply matrix division, 
e.g., using singular value decomposition (SVD) , to the 
matrices C and D. For example, this SVD operation is 
built into the MATLAB programming language (The MathWorks, 
Inc., Natick, MA), and has been described (WH Press, SA 
Teukolsky, WT Vetterling, and BP Flannery, Numerical 
Recipes in C: The Art of Scientific Computing, Second 
Edition. Cambridge: Cambridge University Press, 1992), 
incorporated by reference. 

(4) Thereafter, the spectral overlap is deconvolved on new 
unknown data D' to recover the original dye colors C. 
This is done by computing: 

C = M-l X D' 



While the current art enables the color separation step, 
it is not ideal. Slight variations in the gel (e.g., thickness, 
composition, temperature, chemistry) and the detection unit (e.g., 
laser, CCD, optics) can contribute to larger variations in 
fluorescent response. An operator may encounter the following 
problems : 

• Calibrating the correction matrix M"! on one gel run does 
not necessarily model the "spectral bleedthrough" pattern 
accurately on future runs. 

• With 96 (or other multi-) capillary electrophoresis, each 

of the capillaries forms its own gel system, whose 
different properties may necessitate a separate calibratior 
matrix. 



-23- 



• Accurate dye calibration is technically demanding, labor 
intensive, time consiiming, and ejspensive. Moreover, it can 
introduce considerable error into the system, particularly 
when the manual procedure is not carried out correctly. In 
such cases, the correction is imperfect and artifacts can 
enter the system. 
Such color "bleedthrough" (also termed "crosstalk" or "pullup") 
artifacts can severely compromise the utility of the acquired 
data. In some cases, the gels must be rerun. Often scientific 
personnel waste considerable time examining highly uninf ormative 
data. 



In one preferred embodiment, the color matrix is 
calibrated directly from the data, without recourse to separate 
calibration runs. This bleedthrough artifact removal is done 
using computer algorithms for the calibration, rather than 
manually conducting additional calibration experiments. This can 
be done using methods developed for DNA sequence analysis based on 
general data clustering. While such clustering methods require 
relatively large amounts of data, they can be effective (W Huang, 
Z Yin, D Fuhrmann, D States, and L Thomas Jr, "A method to 
determine the filter matrix in four-dye fluorescence-based DNA 
sequencing," Electrophoresis, vol. 18, no. 1, pp. 23-5, 1997; z 
Yin, J Sever in, M Giddings, W Huang, M Westphall, and L Smith, 
"Automatic matrix determination in four dye fluorescence-based DNA 
sequencing," Electrophoresis, vol. 17, no. 6, pp. 1143-50, 1996), 
incorporated by reference. 



It would be desirable to use the least amount of most 
certain data when determining a color matrix for separating dye 
colors. Note that in the matrix relation "D=MXC", the data 
columns D are known from experiment, when a calibration run is 
used, the 0/1 classification columns C are known; from these known 



-24- 



values, the unknown M can be computed. However, when there is no 
calibration run, the classification matrix C must be dynamically 
determined from the data in order to compute M. This can be done 
manually by a user identifying peaks, or automatically by the 
general clustering embodiment. However, there is a more refined 
and novel approach to automatically classifying certain data peaks 
to their correct color. 



The most preferred embodiment finds a {C, D} matrix pair 
using minimal data. Thus, H (and M"!) can be determined, and the 
rest of the new data color separated. This embodiment exploits a 
key physical fact about spectral emission curves: they are 
unimodal. Referring to Figure 2, such curves monotonically rise 
up to a maximum value, and then monotonically decrease from that 
maximum. Therefore, when a peak is generated from a single dye 
color, spectrally sampled across a range of frequencies, and the 
intensities plotted as a function of frequency, the plotted curve 
demonstrates unimodal behavior - the curve rises up to a maximum 
intensity, and then decreases again. This physical "unimodality" 
constraint on single color peaks can serve as a useful choice 
function for automatically (and intelligently) choosing peak data 
for the classification matrix C. 

The function space equivalent of a unimodal function has 
a very useful property. Suppose that m different frequency 
channels are sampled. That is, there are m frequencies x = {xi, 
X2/ Xm) sampled in the spectral domain. An equivalent 

representation of the m-point function f :m->9t is as an m- vector v 
= <xi, X2, Xin> in the vector space '3^. Now, select an 

appropriate norm on this vector space (say, Li or L2), and 
normalize the vector v relative to its length, forming 



In an Li normalization, w lives on a flat simplex (in the all- 
positive simplex facet). In L2, w lives on a corresponding curved 
surface . 

The unimodality constraint on x imposes additional 
geometrical constraints on w. Because x is unimodal, note that 

XI < X2 < ... < Xk, 

and that 

Xk > Xk+l > ... > Xm, 

where k is the index of the maximum value of x. 

These inequality constraints determine the exact 
subfacet in which w must reside on the simplex facet. Consider 
the case of three spectral sampling points, where the three- 
dimensional vector <x, y, z> eSt^, and suppose that the unimodal 
constraint is x>y>z, corresponding to the first dye color. 
Referring to Figure 3, the locations x, y, and z designate the 
unit locations on the axes in 9t3 (i.e., at <1,0,0>, <0,1,0>, and 
<0,0,1>, respectively). Here, the x>y constraint produces one 
region (horizontal shading) , the y>z constraint another (vertical 
shading) , and their intersection corresponds to the full 
constraint (Crosshatch shading) . Good calibration points for use 
as columns in a {C, D} matrix pair will cluster (white circle) 
around the dye's actual sampling point ratio vector (black cross). 
Conversely, poor candidate calibration points will tend to lie 
outside the cluster, and can be rejected. This geometry in the 
function space permits selection of only the most consistent 
calibration data. 



-26- 



The procedure starts by gathering likely unimodal data 
(e.g., those with highest intensities) for a given observed color 
k. After normalization, these candidate calibration data {wi} 
will cluster within the same subfacet of a flat (m-1) -dimensional 
facet; this facet is the all-positive face of the m-dimensional 
simplex. This geometric constraint follows directly from the 
physical unimodality constraint of pure spectral curves. An 
entire class of simple and effective clustering algorithms are 
based around exploiting this geometric constraint. 

In one preferred embodiment, choose as cluster center 
point wo, the mean vector location of {wi}; vector wq also lies on 
the simplex subfacet (Figure 3) . Then, a small inner product 
<Wi,wo> value tends to indicate close proximity of wi and wq. 
Taking a small set (e.g., l<s<20) of the closest vectors wi near 
wo selects good calibration data points, all pre-classif ied to 
color k. To determine M-l, for each color k, take at most s such 
clustering points {wi}, and use the corresponding {vi) as columns 
in D, where vi is normalized with respect to the maximum element 
in Wi. Form a vector u that is all zeros, except for a 1 in the 
kth entry; place s copies of u as columns in matrix C. This 
produces the required calibration matrices C and D, from which H 
and its inverse are immediately computed using SVD or another 
matrix inversion algorithm. 

In the most preferred color separation algorithm, the 
criterion for peak selection is based on minimizing the spectral 
width of the peaks. A good measure of peak width is the variance 
across the spectral sampling frequency points. The variance 
calculation can take into account the sampling frequencies 
actually used, if desired. In this procedure, the variances of 
candidate peaks for a given color frequency are computed. Those 
peaks having the smallest spectral variance indicate the best 



-27- 



10 



Wo 



30 



calibration points. In an alternative implementation, a best fit 
of the observed frequency curve with a known frequency curve 
(e.g., via a correlation or inner product maximization) can 
indicate the best points to use. 

The method applies not only to raw data, but also to 
data that has been previously color separated by other (possibly 
inaccurate) color correction matrices. This is because the 
unimodality constraint generally applies to such data. 

A useful feature of the unimodality approach is that the 
model can automatically select good calibration peaks, even with 
very sparse data. This is because the function space geometry 
effectively constrains the clustering geometry. Such sparse-data 
clustering algorithms are particularly useful with capillary data, 
where one capillary may only have 1 or 2 useful calibration peaks 
corresponding to a given color. Despite this data limitation, the 
method easily finds these peaks, and effectively separates the 
colors. Such automation enables customization of the dye 
separation matrix to each capillary on each run, which virtually 
eliminates the bleedthrough artifact. 



It is useful to have a quality score that measures how 
well the computed matrix correction actually corrects the data. 
25 This can be done by comparing the expected vs. observed results; 
this comparison can be computed either the separated or 
unseparated domain. Using the computer- selected calibration data 
vectors D, if the computed correction matrix M"! is correct, then 
the 0/1 calibration matrix of column vectors C will be recomputed 
exactly from the data matrix D: 

C = M-l X D 

where, theoretically, C'=C. Measuring the deviation between C 
and C measures how well the correction worked. One 



-28- 



straightforward deviation measure sums the normed Lj deviations 
between each column of C and its corresponding colijinn in C. When 
the deviation is too great (e.g., due to failed PCR 
amplification) , a matrix based on a more confident (manual or 
automatic) calibration can be used. 

Referring to Figure 1, Step 4 is for removing the primers. 

When the separated DNA fragments are formed from labeled 
PCR primers, it useful to remove the intense primer signal prior 
to initiating quantitative analyses. There are many standard 
signal processing methods for removing a singularity from a 
signal, or a row of such large peaks from an image. 

For one dimensional data, first detect the primer 
signal. This can be done, in one embodiment, by smoothing (i.e., 
low pass filtering) the signal to focus on broad variations, and 
then fitting the largest peak (i.e., the primer peak) to an 
appropriate function, such as a Gaussian. Determining the 
variance of the peak from the function fit provides the domain 
interval on which the primer peak should be removed. On this 
domain, it is most preferrable to set the values on this interval 
to an appropriate background value (e.g., zero, if background 
substracted, or an average of neighboring values outside the 
interval) . Alternatively, one can crop the interval from the 
signal. 

For two dimensional data, a projection of the pixel data 
onto the vertical axis of DNA separation finds the row of peaks. 
Determine the spread of this peak signal by curve fitting (e.g., 
with a Gaussian) . Remove the primer peaks by either cropping the 
signal from the image, or setting the values in that domain to 
zero (or some other appropriate background value) . 



-29- 



Ref erring to Figure 1, Step 5 is for tracking the sizes. 

In the preferred embodiment, size standards are run in 
the same lane as the sample data, and are labeled with a label 
different from the label used for the sample data. The task is to 
find these size standards peaks, confirm which peaks represent 
good size standard data, and then align the observed size 
standards peaks with the expected sizes of the size standards. 
This process creates a mapping between the size standard peak data 
sampled in the pixel domain, and the known sizes (say, in base 
pair or molecular weight units) of each peak. 

For one dimensional data, there are no lateral lanes to 
help determine which of the peaks observed in the size standard 
signal represent good data. Therefore, a preferred procedure uses 
prior information about the size standards (e.g., the size) to 
ensure a proper matching of data peaks to known sizes. In the 
most preferred embodiment, use the following steps: 

0. Find some good candidate peaks to get started. 

1. Identify the best peaks to use by filtering poor candidates. 
This can be done by performing quality checks (e.g., for the 
height, width, or peak fit) on the candidate peaks. 

2. Match the expected peak locations to the observed data peaks. 
This is done by applying a "zipper match" algorithm to the 
best candidate data peaks and the expected sizes. This 
matching uses local extension to align the peaks with sizes, 
and includes the following steps: 

2a. Match the boundary, that is, a subset of smallest size 
data peaks or largest size data peaks. The algorithm can 
be rerun several times, shifting the boundary data peaks or 
the expected sizes. The best boundary shift can be found 
by heuristic minimization. One good heuristic assesses the 
uniformity (e.g., by minimizing the variance) of the ratios 



across matching local intervals of the expected size 

standard difference to the observed data peak difference. 
2b. Fix a tolerance interval that includes unity (different 

tolerance intervals can be tried) . 
2c. Starting with the (possibly trimcated) boundaries of the 

expected sizes and observed peaks aligned, extend this 

boundary . 

2d. Compute the ratio p of the difference between the next 
expected peak size and the current expected peak size, to 
the difference between the next observed peak size and the 
current observed peak size. 

2e. Compute the ratio q of the difference between the current 
expected peak size and the previous expected peak size, to 
the difference between the current observed peak size and 
the previous observed peak size. 

2f. If the ratio (p/q) is greater than the tolerance 

interval's greatest value, then the observed data peak 
falls short. Reject the observed data peak, advance to the 
next observed data peak, and continue. 

2g. If the ratio (p/q) is less than the tolerance interval's 
least value, then the expected data size falls short. 
Reject the expected data size, advance to the next expected 
data size, and continue. 

2h. If the ratio (p/q) lies within the tolerance interval, 
then the expected data size is well matched to the observed 
data peak. Accept the size and the peak, record their 
match, and advance to the next size and peak. 

If desired, fill in any missing data peaks by interpolation 

with expected sizes from the zipper matching result. 

Compute a quality score for the matching result. Useful 

scores include: 

4a. Fitting the matching of the expected sizes (in base pair, 
or other expected unit) with the peak sizes (in pixels, or 
other observed data unit) . The relationship is monotonic 



-31- 



and typically slowly varying, so a deviation from a fitted 

function (e.g., linear, cubic, or Southern mobility 

relationship) works well. 
4b. Another quality score is the number of size mismatches, 

adjusted by the boundary shift. 
5. Report the peak positions, matched with size standards. It 
is also useful to include the quality score of the match. 

For two dimensional gel data, proceed by tracking on the 
color- separated size standard image. To track simultaneously both 
the sizes and lanes, first focus on a boundary row (e.g., the top 
row) of size standards. Model the row geometry (e.g., curvature, 
size, location) and each of the peaks (e.g., height, shape, 
angle) . Then use this pattern as a set of geometrical constraints 
for analyzing the next row. Use global search to find the next 
row, and more localized search to identify the peaks; refine the 
local peak locations using center-of-mass calculations. Continue 
this process iteratively until the size standard data are 
completely analyzed, and a final grid is produced. It is possible 
to use a prior analysis of the loading/run pattern as calibration 
data in order to speed up subsequent tracking. The output of this 
lane/ size tracking is a mapping between the expected (lane, base 
pair size) coordinate grid, and the observed (x pixel, y pixel) 
gel data image coordinate grid. 

The quality of the tracking result can be scored by 
comparing the expected grid locations with the observed data peak 
locations - straight data grids indicate a high quality result, 
whereas large distortions indicate a lower quality result. One 
useful quality score measures the curvature of the observed peaks 
by forming the sum of local neighbor distances, and normalizing 
relative to the neighbor distances in a straight idealized grid. 
This grid is preferably formed from the known lane loading 



-32- 



pattern, the known size standard positions, and observed grid data 
(e.g., mean boundary positions) . 

With two dimensional gel image data f(x,Y,z), z denoting 
5 the color plane, there also is the step of extracting lane 
profiles from the gel image to form a set of one dimensional 
profiles. The lane tracking result is used in this step. For 
each lane, form the mapping from y-pixel values to the (x,y) pixel 
locations indicated by the size standard location found for the 
10 lane. Create a spline function (e.g., cubic) that interpolates 
through the mapped V:y-^x pixel pairs to all other (x,y) pixel 

q: values. Then, extract a one dimensional profile (for each color 

JM k) , where the domain values down the lane are the desired 

5 sequential pixels y, and the range values are computed from the 

;{5 function f(v(x),y,z). This procedure is done for each data image 

\B plane z, extracting either color separated data (by applying the 
7" correction matrix) or iinseparated data. 

Referring to Figure 1, Step 6 is for extracting the profiles. 

So 

M The signal profile of a capillary or lane is f(xpixei,z), 

where z denotes the label color (or channel) . This function is 
available for capillary data available after processing Step 2, 
and for gel data after extracting profiles in Step 5. 

25 

The pixel sampling of the electrophoresis distorts the 
sizing of the raw data. The size tracking result from Step 5 
provides as set of (Xpixei, Xsize) pairs. Use these matched pairs to 
form the coordinate transformation RrXpixei^Xgize • Combining the 
30 functions f and fj. via a double interpolation to form: 
fOH-l(Xsize.z) , 



-33- 



a new function that describes the signal as a function of size 
standard units (instead of pixel waits) , preferably in base pair 
size units. If color separation has not yet been done, the 
correction matrix is applied at this time. These transformed 
(capillary or gel data) profiles are then preferably stored in an 
nxk data structure of sized profiles, where n is the number of 
pixel samples, and k is the number of dye colors. 

Note that the fpixelO|X-l (xgize- z) transformation which 

maps 

^size ~^ fsize(^size) 
can be usefully understood as the commutative diagram: 

^size fsize(Xsize) 

I fsize ^ 

I fJ- ^ I identity 

^ fpixel I 

Xpixel — > f pixel (Xpixel) 

The transformation is preferably implemented by 
performing a double interpolation. In MATLAB, this operation can 
be readily computed using the ^interpl' function where: 

YI = interpl(X,Y,XI, method) 
interpolates to find YI, the interpolated values of the underlying 
function Y at the points in the vector XI; the vector X specifies 
the points at which the data Y is given. 

With ^interpl', interpolate the desired size domain 
points into computed pixels using the monotonic bijection between 
esqjected sizes and observed peak pixels: 

pixels = interpl (sizes, peaks, domain, -spline') 
Cubic spline interpolation is done by setting the method to 
^spline'. Then, interpolate the function f on the desired size 



-34- 



domain points (i.e., the computed pixels) using the monotonic 
bijection f (i) between indices {i} and data values f{i}: 
indices = [1 : length (data) ] ; 

profile = interpK indices , data, pixels, 'spline'); 
Computing profile completes the commutative diagram, and produces 
the new function f size (Xsize) • 



Referring to Figure 4, visualizations are shown for the 
results of processing Steps 1-6. 

The computer software preferably extracts and analyzes 
data automatically, without human intervention (if so desired) . 
The software separates colors using a matrix, which is either 
precalibrated or created from the data, depending on which module 
is used. The software tracks lanes and size standards on two- 
dimensional gel data automatically by mapping the expected two 
dimensional lane/size grid to the observed size standard data; on 
one-dimensional CE data, size tracking is done separately for each 
capillary. The user interface software makes manual re tracking, 
zooming, and single-click access to the chemistry panel (or sample 
data used on a run) available to the user throughout. 

A chain of custody is maintained in that the user 
preferably cannot move to the next step without saving results 
(manually or automatically) from the current step. Changes are 
saved, not discarded; moreover, the software records these changes 
incrementally, so that the audit trail cannot be lost by early 
program termination. 



Backtracking capability and flexibility are preferably 
included in the software. For example, should fully automated 
lane tracking fail due to low-guality data, the user can choose 
to: (a) edit the results, and have the program re-track, (b) edit 
the results without re-tracking, (c) manually track the lanes and 



-35- 



sizes, or (d) reject the low-quality gel. A "revert" operation 
provides a universal Undo operation for automatically rolling back 
major processing steps. 

5 Data Scoring 

Referring to Figure 5, Step 7 is for deriving an allelic ladder. 

It is useful to have a set of reference peaks that (a) 
10 correspond to the actual locations of DNA molecules on the gel, 

(b) have known lengths (in base pair units), and (c) cover a large 
C part of the sizing window. This reference set can be developed 

for any fragment sizing genetic assay; without loss of generality, 
2 the preferred embodiment is described for STR genotyping. 

-B Construct a partial allelic ladder by PGR amplifying a 

pool of DNA samples. This allelic ladder, and optionally a known 
p sample, are preferably loaded into the electrophoretic system as 
fiJ separate signals (i.e., in different lanes or colors). After the 
JO gel is run, referring to Figure 1, Steps 1-6 are perforrned to 
O initially process the signals. Then, preferably in size 
M coordinates (rather than in pixel coordinates), a peak finding 
algorithm locates and sizes the peaks in these two lanes with 
respect their in-lane size standards. The known sample's sized 
25 peaks are then compared with the similarly sized peaks found in 

the ladder lane. Following this comparison, the known DNA lengths 
are then assigned to the allelic ladder. The DNA length labels of 
these DNA lengths are then propagated to the unlabeled ladder 
peaks . 

30 

A preferred peak labeling procedure is: 



-36- 



1. In the ladder signal, find the domain positions 
(i-e,, the sizes relative to the size standards) of the allelic 
peaks . 

2. Perform a relational labeling from the known signal 
5 to the allelic ladder signal, as follows. Find the peaks of the 

known signal, and assign to them their known sizes (in allelic 
sizes units, such as base pairs) , Then, match (in size standard 
units) these known allele peaks to the peaks in the ladder signal 
having corresponding size. Designate these ladder peaks with the 
10 allelic size labels of the known peaks. 

3. Extend these allelic label assignments from the 
allelic ladder peaks designated with known labels to the rest of 
the ladder peaks. 

Hi 3a. When the expected size pattern of the alleles on the 

fe allelic ladder is known (as with previously characterized forensic 
y ladders) , a robust method for assigning size labels to data peaks 
ttj uses a standard open/closed search set artificial intelligence 
(AI) algorithm. Start from the most confident data (i.e., the 
g knowns) as the closed set, with all remaining peaks in the open 
Jo set. Each cycle, (i) select an open ladder allele that is nearest 
Ul in allele size to a closed ladder allele, (ii) predict the 
M calibrated size (from the size standards) of this allele using 

interpolation relative to the closed peaks, (iii) find the allelic 
ladder peak whose calibrated size is closest to the predicted size 
25 in (ii), and (iv) if the quality score (based on size deviation) 
is good enough, move this candidate open peak to the closed set. 
On termination, only the most confident ladder peaks in the 
observed data have been matched to expected ladder alleles. See 
(NJ Nilsson, Principles of Artificial Intelligence. Palo Alto, CA: 
30 Tioga Publishing Co., 1980), incorporated by reference. 

3b. When expected ladder alleles are not available 
(e.g., with uncharacterized pooled DNA samples), the ladder 
pattern alone (peak spacing or peak heights) usually contains 
sufficient information for designating the labeling. Start with 



-37- 



the most confident data - known allele peaks, or tallest ladder 
peaks. Then, locally extend the allele labels to the size peaks. 
Since there is more uncertainty without a known ladder pattern, 
more search is useful. One preferred method is to (i) find the 
5 tallest (most confident) peak, (ii) match size with nearest (bp 
size) allele, (iii) locally align iteratively for smaller and 
larger peaks, (iv) assign a quality score to the alignment, (v) 
repeat the preceding three steps, but shifted one or two sizes up 
or down, and (vi) select the alignment with the highest quality 
10 score. 

4. Optionally, fill in any missing ladder peaks with 
interpolation by reference to past ladder data. 
5 5. Fill in the virtual points on the ladder. These are 

ip expected ladder alleles that are believed to exist in the 
'■45 population, but that are not represented as peaks in the allelic 
y: ladder. This is done by interpolation. 

^ 6. Return the results assigning allele labels (e.g., in 

base pair or other integer units) to data ladder peaks (e.g., in 
O size standard units) . It is preferable to report the quality 
iMo score of the assigment. 

'£ In nonforensic STR applications, where previously 

characterized ladders are generally unavailable, pooled alleles of 
a given marker can be used as a reference ladder. This novel 

25 approach can help eliminate the size binning problem that plagues 
microsatellite and other STR genetic methods. It is preferable to 
use the same allelic ladder across multiple runs. The ladder can 
be comprised of either pooled DNAs that are PGR amplified 
together, or post-PCR amplified products that are then pooled 

30 together. It may be desirable to visually inspect the ladders. 
In one preferred embodiment, previously uncharacterized ladders 
are checked the first time that they are encountered, with a hiiman 
editor identifying the best peaks to use and matching them against 
their expected sizes. Recording quantitative peak data for a 



-38- 

ladder can enable the use of quantitative computer-based matching 
of reference ladders and new unknown ladders (e.g., for peak and 
size alignment) by correlation or other inner product methods. 

5 When the interpolation and extrapolation have finished, 

the allele sizes that were actually present as peaks in the ladder 
data, as well as desired allele sizes that were not present as 
peaks, have all been allocated to size positions. 

SNP ladders can be developed in a similar way to STR 
ladders. With multiplexed SNPs, it is useful to run allelic 
sizing ladders comprised of actual SNP data for the markers of 
Jj interest in a signal. Then, comparison of the unknown SNP sample 
III data with the SNP allelic ladder can remove uncertainty regarding 
^5 the correct size (hence, allele) assignment. The same reasoning 
yj applies to any size-based assay for which a (partial or complete) 
yy ladder of candidate solutions can be developed. 

Q With gel electrophoresis, it is preferable to include 

Jo on the gel at least one allelic ladder for every marker used. For 
1^ example, one lane can be dedicated to ladders for the marker panel 
M used in the other lanes; this lane can be loaded in duplicate on 
the gel. It is also desirable to include at least one known 
reference allele lane on every gel; this signal can be one or more 
25 positive PGR controls. The advantage of duplicate control lanes 
(for ladders and reference controls) is that when there is a PGR, 
loading, detection, or other failure in one lane, the signal in 
the other lane can be used. Moreover, a comparison of the two (or 
more) signals can suggest when such a failure may have occurred. 

30 

With capillaiT^ electrophoresis, it is similarly 
preferrable to use ladder and reference controls. 

• With single capillary systems (such as the ABI/31C) , these 
controls should be run at some point during the lifefime of 



-39- 



the capillary. Preferably, the controls should be run as 
often as the temporal variation in the sizing system (i.e., 
differential sample and size standard migration) warrant. 
• With a multiple capillary instrument (e.g., ABI/3700, 
5 MegaBACE, etc.) each capillary can form its own 

electrophoretic separation system. In the most preferred 
embodiment, the allelic ladder, known reference samples, and 
any other capillary controls (for fragment sizing, color 
separation, etc.) are run at least once for every capillary, 

10 with the calibration results then applied specifically to 

that capillary. For example, consider the case of using one 
P^nel comprised of a set of markers, with this panel applied 

J to a set of s samples, run out on an n-capillary instrument 

that achieves r sequential runs per capillary with acceptable 

|:5 sizing fidedility. Suppose (for high- throughput studies) 

W that, approximately, nxr < s. Then, one of the runs (e.g., 

run number 1 or r/2) should have allelic ladders in all n 
capillaries, and another run (e.g., run number 2 or l+r/2) 
fU should have allele references in all n capillaries. 

O Generating data containing these allelic ladder sizing 

q controls, and analyzing the data as described in this step, reduce 
run- to-run sizing variation. Such reduction in gels or 
capillaries is crucial for achieving reproducible sizing assays. 

25 Variable run conditions (e.g., temperature, gel consistency, 

formamide concentration, sample purity, concentration, capillairy 
length, gel thickness, etc.) can induce differential sizing 
between the PGR products in the sample and the internal size 
standards. These allelic ladder comparison methods help correct 

30 for such variation. 



-40- 



Ref erring to Figure 5, Step 8 is for transforming coordinates. 

In the preferred embodiment, size comparisons used in 
the analysis are performed in the allelic ladder size coordinate 
system, rather than in the size standard coordinate system. While 
the latter approach is also workable, the former method has the 
advantages that the reference system is comprised of DNA bands 
size calibrated to actual integer DNA molecule lengths, rather 
than to artibrary fractional molecular weight sizing units. Using 
integer DNA lengths (e.g., true base pairs) is closer to the 
physical reality, and can simplify the data analysis, logical 
inference, communication with the user, quality assurance, and 
error checking. 

In one preferred embodiment, the signals are kept in 
size standard units, but comparisons are made in the allelic 
ladder frame of reference. For example, in comparing sample peaks 
with allelic ladder reference peaks, the sample peaks can first be 
interpolated into a domain based on the allelic ladder peak sizes, 
prior to comparing the sample peaks with any other reference 
sizes . 

A ladder-based peak sizing method can establish a direct 
connection between observed peak sizes and actual DNA fragment 
lengths (possibly up to a constant shift) , Transforming data 
sizes into DNA lengths overcomes size binning problems. For 
example, rounding number to the nearest integer in DNA length 
(i.e., ladder) coordinates permits the assigning of consistent 
labels to each peak; this consistency is not achieved when round 
fractional peak size estimates. Moreover, the peak's deviation 
from the integer ladder provides a quality measure for how 
consistently the peaks are sizing on a particular size separation 
run. 



-41- 

In the most preferred embodiment, the sample signals are 
transformed from size standard units into allelic ladder DNA 
length units. Step 6 provides the sized signal profile of a lane 
(or capillary) : 

5 fO!X-l{Xsize) • 

Step 7 provides a characterized allelic ladder that matches the 
alleles of the ladder in size standard units with corresponding 
DNA molecule lengths in base pairs (or other suitable integer- 
spaced units) . This pairing defines a coordinate transformation 
10 for that ladder: 

% Combining the function fo\i-^ together with the function X (this can 

^ be done using a double interpolation) forms: 
f0|l-l0>.-l(xiength) r 
which represents the signal intensity f in terms of DNA length 
xiength- The relevant mathematics and computer implementations are 

n detailed above in Step 6, extracting the profiles. For each 

H= marker, with n lanes or capillaries, and k colors, these 

transfoinned profiles can be preferably stored in an nxk data 

So structure (e.g., in memory, or in a file). 

This procedure provides a method for analyzing a nucleic 
acid sample. The steps included: (a) forming labeled DNA sample 
fragments from a nucleic acid sample; (b) size separating and 

25 detecting said sample fragments to form a sample signal; (c) 

forming labeled DNA ladder fragments corresponding to molecular 
lengths; (d) size separating and detecting said ladder fragments 
to form a ladder signal; (e) transforming the sample signal into 
length coordinates using the ladder signal; which in turn permit 

30 (f) analyzing the nucleic acid sample signal in length 
coordinates, as follows. 



-42- 



Ref erring to Figure 5, Step 9 is for quantitating signal peaks. 

In addition to sizing DNA peaks, it is also useful to 
quantitate the relative amount of DNA present. To do this 
accurately requires taking account of band overlap. Few systems 
currently perform this band overlap analysis; those that do (e.g., 
Cybergenetics ' TrueAllele software), use a combinatorial approach 
that increases analysis time greatly as the number of adjacent 
peaks increases. This combinatorial cost can impede analyses of 
large allelic ladders or of differential display data. In the 
preferred embodiment, as described herein, the DNA quantitation 
step for resolving band overlap should computationally scale 
(e.g., at linear or small polynomial cost) with the number of 
bands analyzed. 

In DNA length coordinates, as developed in Step 8, a 
peak has a natural shape that stems from band broadening as it 
migrates down the gel or capillary. (For accurate peak 
quantitation, DNA length coordinates are preferrable to size 
standard coordinates, and size standard coordinates are 
preferrable to pixel scan, coordinates.) Centered at location Xk, 
this peak shape can be described as a Normal function 

0^ (x) <^ e A ^ . 

on the leading edge, and a Cauchy function 

on the receding edge; the functions share the same height h at Xk- 
Band broadening implies that as Xk increases, the width parameters 
c and a will increase, the height h will decrease, while the area 
remains constant. This is confirmed by fitting peak spread data 



-43- 



as a function of DNA length with a third order es^onential 
function. 

Using the changing band shapes as a set of basis 
5 functions, write the trace f (in DNA length coordinates) as: 

n 

k=l 

where basis function ^ depends on <xk, Gk, ak, hk>, i-e., the 

center position xk. Normal spread Gk, Cauchy spread ak, and center 
height hk- 

10 

-J From the data f (x) , solving for the coefficients {ckl 

provides an estimate of the DNA concentration at each peak. Less 
% efficient approaches (e.g., TrueAllele) currently solve this 
k; equation by a least-squares search of the 4n coefficients to the 
1^ data f , with combinatorial computing time proportional to exp(4n) . 
L However, by exploiting the model functions together with 

m function space mathematics, this computing time can be reduced to 

y roughly linear cost, proportional to n. See (F Riesz and B Sz.- 

Q Nagy, Functional Analysis. New York: Frederick Ungar Publishing 

® Co., 1952), incorporated by reference. 

Normalize the basis functions ^ so that 
<<t>k(^X 0^W> = 1 

and note that the band overlap coefficients bk,j can be numerically 
25 estimated from the model functions as 
<<l>k(^X <t>j{x)> = 

Then observe that with initial estimates of Xk (assuming one peak 
per bp size k) , rewrite the inner product <f, as: 

k^i k^i 



-44- 



Setting the data derived values {dj = <f, ^j>}, and using 

appropriate vector-matrix notation, yields the relation: 
d = cxB 

With sparse peak data that has little band overlap 
(e.g., tetranucleotide repeat data) B = In (the identity matrix), 

and c — d immediately yields the DNA concentrations at every peak 

k. More generally, B is largely an identity matrix (i.e., 
primarily zeros of f -diagonal ) , but has a few small near-diagonal 
elements (the band overlaps) added in. Solve for c from the 
observed data vector d and the band overlap matrix B using a very 
fast matrix inversion algorithm (e.g., SVD) that exploits the 
sparse nature of the local overlap coefficents. 

In the most preferred embodiment, the overlap matrix B 
is used to rapidly estimate the DNA concentrations {c]^} at the 

peaks. This is done by computationally exploiting the functional 
analysis. Since 

d=cxB, 
rewriting yields: 

dxB ^ =cxB(xB ') = c 

or 

c = dxB-'; 

and so the DNA concentrations c can be estimated immediately from 
the values {dj = <f , 0j>} derived from data signal f once the 

and their overlap integrals B = <0j, ^> are known. 

In the preferred embodiment, estimate the basis 
functions and their overlaps. Each basis function depends on 



-45- 



<Xk/ Ok/ OCk^ hk>. To reduce this four dimensional search to one 

dimension, proceed as follows* 

• Xk can be estimated from the peak center, 

• hk is not a factor, since each ^ is normalized to integrate 

to 1. 

• Gk and ak, are empirically observed to be in a fixed ratio 

to one another (at least in local neighborhoods) , due to 

the relatively constant peak shapes. 
Therefore, a computationally efficient approach is to perform a 
one dimensional search for Gk and ak, keeping their ratio fixed. 
Then, if so desired, perfonn a quick local two dimensional search 
in the solution neighborhood to refine the values of Gk and ak* 

To find the peak width parameters Gk and ttk/ first fix a 
value of G in a local neighborhood; in ID search, set a to a fixed 
proportion of G; in 2D search, set the value of a as well. For 
each observed peak, set the center xk of <^ to the peak's center 
location. Following normalization, the basis functions {^kl are 
determined. Then, for j's in the neighborhood, compute dj = <f, 
0j>, or: 

d = <f , ^> 

Determine B by numerical (or closed form) integration of 
the basis functions 0j and (j^: 

Compute B"^ by inverting B, or: 



-46- 



Note that terms far from the diagonal are essentially zero. These 
local neighborhood calculations of d and B"^ are rapid, and are 
used in the inner loop of the following minimization procedure. 

5 Minimize the difference between the observed data signal 

f and the expected model function: 

n 

f(x)-^c,-(t>,ix-x,) 

k=l 

by substituting the computed from: 
c = dxB-' 

10 Minimization (e.g., using an norm) of the expression: 
C \nx)-f^idxB-\-(^,ix-x,) 

^ 11 t=l 2 

Q in the neighborhood as a (and possibly a) vary, finds the best 

y estimate of the widths of the basis function. These widths can be 

ffi modeled locally, interpolated by fitting with a cubic exponential 

T5 function, and then used to generate appropriate basis functions 

O across the full range of sizes. 



L!1 Application of these modeled basis functions to the data 

y function f produces robust estimates of the DNA concentration at 
20 each DNA peak k, since, after convergence, 
c=dxB ^ 

In the current art, workers use the heights or areas of the data 
peaks. Because of the extensive peak shape modeling done here, it 
is preferrable to use the heights (i.e., c^^) or areas (determined 
25 from and the peak shape, e.g., by closed form evaluation) of 
the computationally modeled peaks. 



30 



The match-based quantitation described herein is very 
well suited to highly multiplexed data, such as SNPs, differential 
display, DNA arrays, and pooled samples. This is because the 



-47- 



inner product operations (and local band overlap corrections) can 
be applied accurately and in constant time, regardless of the 
number of data peaks in the signals 

It is useful to assess the quality of the individual 
peak quantitation results. This assessment can be done by 
comparing the modeled data function 

with the observed data signal f . A normalized deviation 
(computed, e.g., via a correlation) between expected and observed 
based on the minimization search function above can be used as the 
comparison measure. 

Referring to Figure 5, Step 10 is for analyzing the data, 
preferably by calling the alleles. 

In the preferred embodiment, allelic (or other) DNA 
ladder data is available, and the alleles can be called by 
matching sample peaks relative to the ladder peaks. This match 
operation is fast, reliable, very accurate, and accounts for 
inter-gel or inter-capillary variations. Depending on the 
application, a window (typically ±0.5 bp) is set around a ladder 
peak of calibrated DNA length. When a sample peak (preferably in 
ladder coordinates) lies within this bp size window, it can be 
reliably designated as having the length of that ladder peak. 
Zero size deviation between the centers of sample and ladder peaks 
indicates a perfect match; greater deviations indicate a less 
reliable match. 

In an alternative preferred embodiment, the data 
profiles can be stored in size coordinates, and brought into 
length coordinates only when needed. This is done by retrieving a 
sample's size coordinate profile, as well as and the ladder peaks. 



-48" 



The sample's peaks are found in size coordinates, and then 
interpolated in into length ooordinates using the ladder size to 
peak mapping, as described previously. The sample's peaks are 
then in length coordinates, and can be rounded to the nearest 
5 integer, or matched against integers representing valid allele 
designations. The deviation from such integers (e.g., the 
fractional component) can be used a measure of quality, accuracy, 
or concern. 

10 In other embodiments, found in the current art's 

fragment sizing software, the peak sizes (e.g., the alleles in 
genotyping applications) are analyzed in a size standard 
coordinate system, and ladder calibration data is not used by the 

y?; computer. This analysis entails a bottom-up collection and 

comparison of data from many samples to form data-directed bins. 

.j.]; These bin distributions are then used to designate sample peak 
size. However, the distribution variance across multiple 

r electrophoresis runs on different samples can be quite high. 

□ When these bins have overlapping sizes, allele size designation 

^^0 becomes quite uncertain. 

y In the preferred embodiment, both DNA length and amount 

are used in scoring the data. With STR genotyping, for example, 
where there can be multiple peaks, the DNA concentration (e.g., 

25 modeled peak height or area) is used as measure of confidence in 
the observed peak. With a heterozygote genotype, two peaks are 
expected, and with a homozygote, only one. Since other peaks may 
be present from mixtures, PGR artifacts, or other DNA sources, the 
analysis will focus on the higher concentration peaks, 

30 particularly those peaks residing in allelic ladder windows. Most 
of the peak signal mass should be concentrated in the most 
confident peaks: high DNA amount, and in a ladder window. When 
this is not the case, confidence in the data is lower. 



-49- 



Once the peaks have been quant itated at known DNA 
lengths, the data can be further analyzed. Such analyses include 
stutter deconvolution, relative amplification correction, allele 
calling, allele frequency determination (from pooled samples), 
differential display comparisons, and mutation detection. In 
genotyping applications, allele calling should be done on the 
signals only after corrections (e.g., for stutter or relative 
amplification) have been made. See (MW Perlin, "Method and system 
for genotyping," U.S. Patent #5,541,067, Jul. 30, 1996; MW Perlin, 
"Method and system for genotyping," U.S. Patent #5,580,728, Dec. 
3, 1996; S-K Ng, "Automating computational molecular genetics: 
solving the microsatellite genotyping problem, " Carnegie Mellon 
University, Doctoral dissertation CMU-CS-98-105, January 23, 
1998), incorporated by reference. 

In one preferred embodiment, allele calling on 
quantitated corrected data is done by: 

1. Finding the largest peak (area or height), and ensuring 
that is within a window on the allelic ladder. 

2. Removing all peaks from the signal that either (a) have a 
DNA length that is not in a window of the allelic ladder, 
or (b) have a DNA amount that is not within some minimum 
percentage of the largest peak. 

3 . Calling the alleles by matching the DNA lengths of each 
sample peak to the DNA sizing windows on the allelic 
ladder . 

4. Applying rules to check for possible data artifacts. 
Typical rules are described below. 

5. Computing a quality score, particularly for those data 
apparently free of data artifacts. Various quality score 
components are discussed below. 

6. Recording the designated alleles, and the quality of the 
result . 



-50- 



Ref erring to Figure 6, the results of ladder processing, peak 
quantitation and allele calling can be visualized. 

There are many "junk-detecting" rules that can be designed 
5 and applied to data. Critical to all such rules is the ability to 
compare observed measures against expected behaviors. By modeling 
the steps of the processing, computing appropriate quality scores 
at each step, and comparing these observed data features with 
normative results, the invention enables a precise computer 
10 diagnosis of problems with data signals and their quality. Some 
example rules include: 

• Noise only. Using measures (such as Wilcoxon's signed- rank 
% statistic) to test for randomness, a computer program can 
m decide that an e3<periment produced primarily noise (BW 

%5 Lindgren, Statistical Theory, Fourth Edition. New York, NY: 

y Chapman & Hall, 1993), incorporated by reference, 

'2 • signal. The peaks should have heights over a certain 

7 (user-defined) minimum threshold. When a profile's highest 

;3 peak does not reach that threshold, flag the problem. 

?^ • High signal. The peaks should not be over a certain (user- 

yl defined) maximum threshold. When a profile's highest peak 

y does exceeds that threshold, flag the problem. 

• Peak dispersion. The designated peaks should comprise a 
certain percentage of the total signal. If a profile's 

25 designated peaks contain less than that percentage, fire 

the rule. 

• Algorithm conflict. When multiple scoring algorithms are 

used, they should agree on the scoring results. Report any 
conflicts. 

30 • Relative height. For some applications (e.g., forensic STR 

analysis) the relative peak heights should be within a 
predefined ratio of each other. Indicate when a genotype 
has a second peak with a relative height that is too low. 



-51- 



• Third peak. One (homo zygote) or two peaks (hetero zygote) 

should contain most of the DNA signal. Indicate when a 
genotype has a third peak that contains too much DNA 
signal . 

5 • Off ladder. All the allele peaks should be close to their 

ladder peaks. lAlhen one (or more) of the alleles are too 
far away from their ladder peak, inform the user. 

• Uncorrelated. When there are two allele peaks, their 

centers should deviate similarly from their respective 
10 ladder peaks. When a genotype has deviations that do not 

correlate (i.e., their difference exceeds some threshold), 
flag the problem, 

J? ♦ Control check. The calls computed for a control experiment 

tf" should be consistent with the known results. When they are 

TeJb not, flag the result. 

Note that some of these rules (e.g., ''of f- ladder " ) make use of 

allelic ladders, when they are present. 

M To better understand the decision support role of each 

-10 numerical quality score (such as those used in rules) , and their 
:j1 decision thresholds, it is useful to collect the scores for a 
large set of data and analyze them. Collection proceeds by 
recording the set of scores for each applicable genotype result, 
and indexing the genotypes (say, by sample, gel, and locus) . 
25 Consider each such numerical score as a mapping from the genotypes 
to Classify the genotypes by how they were scored; one 

preferred classification includes unscorable (e.g., noise), low 
quality (e.g., rule firings), correctly called good data (e.g., 
human agreement), and incorrectly called good data (e.g., human 
30 agreement) , The result classification can be obtained by 

comparing the computer calls (and rule firings) with human edited 
(or otherwise independently scored) data. On some useful subset 
of genotypes (e.g., each locus examined separately), the numerical 



"52- 



quality scores can be collated and histogrammed for each result 
classification; this produces a set of distributions, one for each 
classification. Comparing (numerically, visually, statistically, 
etc) the different distribution profiles for the different 
classifications provides insight into the utility and scaling of a 
numerical quality score, permitting the derivation of decision 
thresholds or probability models. In a preferred embodiment, a 
decision threshold is statistically determined by distinguishing 
two score distributions (e.g., correct and incorrect results) 
according to a determined sensitivity or specificity. In another 
preferred embodiment, linear or logistic regression is used to 
model the probability of an accurate allele call. These 
thresholds or probabilities can be displayed to a user for 
enhanced confidence or decision support. 

It is useful to compute a quality score on the good 
data; one criterion for "good data" is that the experiment does 
not trigger any "junk" detecting rules. Quality scores enable the 
ranking of experiment results for selective review, as well as the 
computation of accuracy probabilities. Many possible quality 
scores can be developed, depending on the application and the 
available data. In all cases, there is some kind of comparison 
between expected behavior and an observed result - small 
deviations indicate high quality, whereas large deviations suggest 
lower quality. Example quality scores include: 

• Domain measures. When ladder data is present, it is 
possible to compute the deviation between a candidate 
allele peak center and its associated ladder peak center. 
When ladder data is not present, a similar comparison can 
be made relative to a precalibrated bin center, rather than 
a ladder peak center. One useful score is the maximuin 
(over the scored alleles) of the absolute value of the 
deviations. When this number is close to zero, the 



-53- 



confidence is high. As it increases to 0.5bp, the result 
becomes less confident. 

Range measures, A sizing data result pertains to an 
particular niomber of peaks; any additional peaks represent 
a dispersion of the signal mass away from the hypothesized 
score. Ideally, all the signal mass should be present in 
the called peaks, which can be measured by the peak 
centering ratio = (called peaks) /(all peaks). When this 
ratio is near unity, the confidence is high. As it 
decreases to zero, the result becomes less confident. 

Product measures. A product of a domain measure with a 
range measure can be a sensitive indicator of quality. In 
one preferred embodiment, let a result scale with 1 as the 
highest quality, and 0 as the lowest quality. Rescale the 
domain measure above to map the score interval [0,0.5] to 
the quality interval [1,0], with all scores above 0.5 set 
to 0. Rescale the range measure above to map the score 
interval [0,1] to the quality interval [1,0]. Then form 
the product of these two rescaled scores, so that the 
result lies in the interval [0,1]. 

Function measures. Once all data corrections have been 
made (e.g., for PGR stutter, relative amplification, peak 
shapes, peak sizes, peak center deviations, band overlap, 
etc.) on the fully quantitated and modeled signal, the 
inverse of these corrections can be applied to the size 
result to resynthesize the signal. Comparison of the 
resynthesized signal with the data signal provides a 
measure of how completely the analysis modeled the data - 
the residual deviation can measure lack of confidence in 
the result. One such comparison is a correlation between 
the synthesized and data signals. This correlation can be 
computed so that small values indicate confidence (e.g., 
using a normalized L2 deviation) , or so that larger values 



-54- 



indicate confidence (e.g., using a normalized inner product 
or statistical correlation measure) . 
The development of some quality scores has been described (MW 
Perlin, "Method and system for geno typing, " U.S. Patent 
5 #5,876,933, Mar. 2, 1999; S-K Ng, "Automating computational 
molecular genetics: solving the microsatellite genotyping 
problem, " Carnegie Mellon University, Doctoral dissertation CMU- 
CS-98-105, January 23, 1998; B Palsson, F Palsson, M Perlin, H 
Gubjartsson, K Stefansson, and J Gulcher, "Using quality measures 
10 to facilitate allele calling in high- throughput genotyping, " 
Genome Research, vol. 9, no, 10, pp. 1002-1012, 1999), 
incorporated by reference. 

1j; Probabilities can be computed from quality scores. The 

fe individual components of the quality scores generally lie on a 
numerical scale. Histogramming and multivariate regression 

-B analysis can provide insight into the distribution of the 
correctly scored '^success" data population relative to the 

p distribution of the incorrectly scored ^failure" data population 

jjo along each of these measures. A logit transformation of this 

dichotomous outcome variable is useful, and provides [0,1] bounds 
for probability estimates, and the use of a binomial (rather than 
normal) distribution for the error analysis. By applying standard 
logistic regression, the key underlying independent variables can 

25 be elucidated. This logistic regression analysis can help 

determine the thresholds used in the junk-detection" rules, and, 
for each experiment, can compute the probability of an accurate 
score from the observed variables. For example, the domain and 
range measures used above can be used as two independent 

30 variables, with the outcome being success or failure in the 
computer correctly calling the allele results. Logistic 
regression on these variables with a preanalyzed data set can be 
used to construct a correctness probability for allele calls on 
further data sets. See (A Agresti, Categorical Data Analysis. New 



"55- 



York, NY: John Wiley & Sons, 1990; DW Hosmer and S Lemeshow, 
Applied Logistic Regression. New York: John Wiley & Sons, 1989; 
Statistical Analysis Software, SAS Institute, Gary, NC; 
Statistical Package for the Social Sciences, SPSS Inc., Chicago, 
5 IL) , incorporated by reference. 

Data Review 

Once the automated computer scoring has completed, it is 
10 often useful to have a person assess the results . Since reviewing 

perfectly scored data is an inessential step, the data review 
U should optimally focus on reviewing the least confident (but 
:7; scorable) data. The outcome of this focused review is typically a 
O decision for each experiment's result: accept the result, modify 
J-S the result, reject the result, plan to redo the experiment in the 
uj laboratory, and so on. 

;3 In order to focus the data review, it is helpful to have 

a prioritization. In the preferred embodiment, the quality score 

,;5o or accuracy probability is used to rank the experiments. The 

Q review may arrange the suspect experiments in different subsets, 
e.g., grouped by marker, sample, equipment, personnel, time, 
location, etc. By reviewing the worst data first (i.e., rule 
firings, least quality scores) , the user is enabled to focus on 

25 evaluation and repair of only the suspect data. Not reviewing 
highly confident data frees up human operator time for other 
process tasks. Moreover, not reviewing unscorable data is 
similarly useful. 

30 Using these methods, low quality data can be identified 

and classified. At the single datum level, individual results can 
be examined and better understood. At the data set level, 
problematic loci, samples or runs can be identified by examining 



-56- 



percentages of outcome indicators (e.g., rule firings, low quality 
scores, calling errors) relative to data stratifications. For 
exainple, the number of miscalls of known samples (arranged by 
locus and gel run) can identify problematic data trends early on. 
5 This information can provide useful quality control feedback to 
the laboratory, which can help improve the overall quality of 
future data generation. 

Referring to Figure 7, it is useful to present the 
10 results graphically. Visualizations of the experiments (e.g., 
gel lane tracking or color separation) and the signal traces help 
the user understand potential problems with the data, and 
% possibilities for their correction (both for the particular 
m experiment, as well as for tthe overal data generation process) , 
®5 In the preferred embodiment, multiple graphical visualizations for 

inspecting and reviewing genotype data include interfaces for: 
IB • inspecting and annotating the raw gel data; 

J''' • inspecting and editing the automated lane tracking and sizing; 

C • assessing data quality and marker size windows; 

j^O • reviewing and editing automatically called alleles (by quality 
m score priority, or other user-selected data rankings) , 

^ preferably with all data and inferences visually presented in 

the context of the marker's allelic ladder; 

• visually examining the data signal, preferably overlaid upon 
25 an allelic ladder; 

• flexible examination of bleedthrough (or "pull-up") artifact; 

• reviewing computed DNA quantitations and sizes, preferably in 
the context of the allelic ladder. A more detailed window 
explores the quantitation results. 

30 • showing the allele calls; 

• providing access to a more detailed textual window presenting 
a summary of useful allele calling information in tabular 
form; 



-57- 



• focusing on the allelic ladder data signals, when available; 
and 

• displaying selected multiple lane signals graphically in one 
view. 

5 Such multiple representations visually show the quality of the 
data, and provide diverse, focused insights into the data and 
their processing. Such graphical interfaces are generally used on 
only a small fraction {e.g., less than 5%-10%) of the data, since 
most high- throughput users do not care to revisit high-quality 
10 allele calls on high-quality data. 

Referring to Figure 8, it is also useful to present the 

results textually. Textual information can facilitate a more 
iTi detailed understanding of a particular experiment. It is helpful 

for the user to have rapid access to both graphical and text 
y presentation formats. The preferred embodiment provides 

information on allele designations, ladder comparisons, molecular 
™" weight sizes, and genotype quality. This display also gives 
□ explanations of rules that may have fired. The table includes 
go peak size, peak height, peak area, and peak fit quality. The 
;jT display is extensible, permitting modifications to the display 
M that can draw from the expected and observed values that are 

computed for each scored genotype. 

25 The preferred embodiment provides very flexible 

navigation between the graphical and textual view. A gel display 
permits viewing of any combination of gel image, size standard 
peak centers, or lane/size tracking grid, as well as editing 
capability. An allele call display provides navigation facilities 

30 (e.g., buttons and menus) for rapidly selecting samples and loci, 
including zoom, slider, overlay, and relational options. With a 
single click, the user can examine the peak and ladder sizes of 
any designated allele. The interface also automates many of the 



-58- 



mundane display aspects (e.g., signal resizing, quality 
prioritization) , thereby enabling rapid user navigation. 

For maxim-urn flexibility, the preferred embodiment 
reports results to computers and people in multiple ways. 
Preferred modalities include: 

• Providing a flexible format (e.g., tabbed text files, or SQL 
queries) for seamless interaction with database, spreadsheet, 
laboratory information management system (LIMS) , text editing, 
and other computer software. 

• Providing such a flexible fo3rmat for input information. 

• Providing such a flexible format for output results. 

• Providing such a flexible format for logging, audit, and error 
messages . 

• Recording rule firings and quality scores (along with allele 
calls) in result files (e.g., tabbed text) in such a flexible 
format. The rule firing representation is extensible and 
backwardly compatible, so that it is easy to add more rules 
over time as more cases are observed. 

• When a window focused on a genotype, displaying the fired 
rules (if any), thereby setting the context for why particular 
low quality data were rejected. 

• Listing or explaining the fired rules in plain English in a 
textual window for the human operator. 

• Listing the designations, allelic ladder information, 
molecular weight deviations (between the designation and its 
allelic ladder), quality score, or a table of peak sizes, 
areas, heights, and qualities. 

Different data artifacts are typically associated with 
their own specific visualizations. For example, a signal 
containing size-designated standard peaks do not overlay properly 
on a different signals' size-designated standard peaks; referring 
to Figure 9, this improper overlay can be visualized by 



-59- 



super imposing the signals in different colors. For most data 
artifacts, human data reviewers painstakingly (a) contract an 
appropropriate visual representation (e.g., an overlay of specific 
signals) to (b) confirm or disconfirm the presence of the 
5 artifact. The more efficient approach of the preferred embodiment 
is to reverse this order. First (b) , have the computer 
automatically determine (from rules or other quality scores) what 
data artifacts are present, and then (a) automatically construct 
visualizations that are customized to each specific artifact. 
10 These data-directed visualizations can be opened up automatically, 
or displayed upon request. 

5 Referring to Figures 4, 5, and 9, to efficiently develop 

j: a library of such data artifact customized displays, the invention 

^5 includes a graphical language interpreter. An element of the 

display library is a message template that operationally 

IE characterizes how to display the artifact. This template is 

^ filled in by applying it to specific data. A message for the 

p interpreter includes a set of n-tuples that describe the type of 

jlO display, the source of data, a size range, fluorescent dye, or 

r:: other useful display information. The data source preferably 

C refers to the nxk representation of data signals. This 

interpreter then dispatches on the display type (e.g., data 
signal, vertical line, fitted curve, annotation, etc.) and 

25 possible subtype (e.g., main data, size ladder, allelic ladder, 
known control, negative control, etc.) of each tuple to supply 
additional display information (e.g., drawing color, line style, 
line thickness, etc.) that is specific to that display type. 
Execution of the set of messages contructs and presents the 

30 customized display to the user for its corresponding data 
artifact. 



-60" 



Analysis System 

Referring to Figure 10, the present invention pertains 
to a system for analyzing a nucleic acid sample 102, as specified 
5 above. The system comprises a means 104 for forming labeled DNA 
sample fragments 106 from a nucleic acid sample. The system 
further comprises means 108 for size separating and detecting said 
sample fragments to form a sample signal 110, said separating and 
detecting means in communication with the sample fragments. The 
10 system further comprises means 112 for forming labeled DNA ladder 
fragments 114 corresponding to molecular lengths. The system 
f^^ther comprises means 116 for size separating and detecting said 
g ladder fragments to form a ladder signal 118, said separating and 
Iff detecting means in communication with the ladder fragments. The 
i|5 system further comprises means 120 for transforming the sample 
yj signal into length coordinates 122 using the ladder signal, said 
m transforming means in communication with the signals. The system 
J"- further comprises means 124 for analyzing the nucleic acid sample 
Q signal in length coordinates, said analyzing means in 
JJO communication with the transforming means. 

W Special Applications 

For many applications, it is useful to generate sizing 
25 results that are comparable across different DNA sequencer 

instruments. This platform interoperability is essential, for 
example, when creating a reference DNA database (e.g., for human 
identification in forensics, or multi-laboratory genetic analyses) 
with data comparisons from multiple laboratories. Should 
3 0 different instrioments at different laboratories yield different 
sizing results for PGR products relative to the size standards, 
then such DNA reference resources become almost useless. The 
present invention uses sizing ladders based on such sample 



-61- 



fragments (e.g., allelic ladders, with or without known reference 
samples) in order to assure that sizing results are based on 
sample fragment sizes. Specifically, said sizing results are not 
based solely on electrophoretic migration of DNA fragments 
5 relative to size standard moledules; such relative migration can 
be highly variable across different instruments, and even on the 
same instrument when different run conditions are employed. The 
present invention overcomes this limitation in the current art, 
uses novel computer-based scoring of properly calibrated data to 
10 provide automated sizing of DNA fragments, and enables true 
interoperability between different sizing instruments. 

^ Via this interoperability, the invention provides a 

ll^: method for producing a nucleic acid analysis. Steps include: (a) 

SB5 analyzing a first nucleic acid sample on a first size separation 

instrument to form a first signal; (b) analyzing a second nucleic 
g] acid sample on a second size separation instrument to fom a 

second signal; (c) comparing the first signal with the second 
2 signal in a computing device with memory to form a comparison; and 
ifcl (d) producing a nucleic acid analysis of the two samples from the 

comparison that is independent of the size separation instrxoments 
□ used . 

In forensic applications, it useful to match a sized- 
25 based genotype (e.g., STR or SNP) of a sample against a reference 
genotype. In making such forensic match comparisons, it is 
preferable to have a computer scoring program designate alleles 
relative to an allelic ladder, rather than using an ''exact" size 
relative to sizing standards. This is because of the highly 
30 variable differential migration of the sizing standards relative 
to the PGR products. An allelic ladder (whether precharacterized 
or dynamically characterized) provides standard reference DNA 
molecule lengths. By comparing and reporting sample DNA fragment 
lengths relative to these constant reference DNA lengths (and not 



-62- 



to variable size standard comigration units) , it is possible to 
reliably match genotypes of a given sample with those of a 
reference sample. This comparison and reporting is enabled by the 
present invention. Moreover, this reliability is essential for 
human identification applications where near zero error is 
required. In criminal comparisons, the sample DNA profile may be 
from a stain at a crime scene, whereas the reference DNA profile 
is from a suspect or a preexisting DNA database, with the goal of 
establishing a connection between a suspect and a crime scene. In 
civil applications, the sample and reference DNA profiles may be 
used to determine degree of relatedness, as in a paternity case. 
(IW Evett and BS Weir, Interpreting DNA Evidence : Statistical 
Genetics for Forensic Scientists. Sunderland, MA: Sinauer Assoc, 
1998), incorporated by reference. 

PCR artifacts can complicate DNA fragment sizing. With 
STRs, PCR stutter, relative amplification (also termed 
"preferential amplification"), and constant bands are common 
artifacts. Using computer-based scoring methods, these artifacts 
can be resolved by stutter deconvolution, adjusting relative peak 
heights based on fragment size, detecting and suppressing 
artif actual bands, and other quantitative methods. (MW Perlin, 
"Method and system for geno typing, " U.S. Patent #5,541,067, Jul. 
30, 1996; MW Perlin, "Method and system for geno typing, " U.S. 
Patent #5,580,728, Dec. 3, 1996; S-K Ng, "Automating computational 
molecular genetics: solving the microsatellite genotyping 
problem, " Carnegie Mellon University, Doctoral dissertation CMU- 
CS-98-105, January 23, 1998), incorporated by reference. 

Differential display is a sensitive assay for relative 
gene expression. In automated computer data analysis of 
differential display gene expression profiles, the objective is to 
identify size bins at which there is a demonstrable difference 



-53- 



between the DNA amounts present in the two profiles* In a 
preferred embodiment, to compare: 

• Transform the expression profiles so that each resides in a 
uniform DNA sizing coordinate system, referring to Figure 
1, Step 5, When the pixel representation is highly 
reproducible (e.g., when running serially on a single 
capillary) , this transforming step can be omitted. 

• Identify the peaks in each profile, and record their x 
domain (DNA size) and y height (estimated DNA 
concentration) values . 

• Compare the domain values of the peaks in each profile, and 

form a set of matching paired peaks (one from each 
profile) . Retain those paired peaks with close x values. 
In the preferred embodiment, the difference between the x 
values is less than or equal to 1/2 base pair. 

• Retain the cross-profile peak pairs having relatively close 
y value ratios. Compute the standard deviation of the 
ratio of the y values for peak pairs, and remove those peak 
pairs whose y-value ratios lie outside a certain standard 
deviation range (e.g., one standard deviation). 

• Rescale the profile ranges so that they approximately 

superimpose. Normalize the first profile by dividing by 
its maximum y value. Model the y-value ratios polynomially 
(e.g., linearly) as a function of x. Normalize the second 
profile by dividing by the modeled y-value ratio function. 

• Refine the peak matching by a two dimensional comparison 

that requires the matching peak pairs to satisfy both a 
certain x-tolerance (e.g., 0.5 bp) and certain y tolerance 
(e.g., 5% relative height) . When the ID peak matching does 
not produce spurious results, this refining step may be 
omitted. 

• Use the matched peaks as a peak ladder, and transform the 

coordinates of one profile into the coordinates of the 
second profile, referring to Figure 1, Step 8. When the 



-54- 



peaks are very close (preferably less than 0.25 bp 
deviation) , this transforming step may be omitted. 

• Compare the superimposed profiles, and identify peak pairs 
whose calculated y-value differences or ratios are 

5 outliers, relative to the paired y value standard 

deviations. These outlying peak pairs represent possibly 
significant up- or down-regulation of gene expression. 

• Report the identified outlying peak pairs. This can be 

done as two lists (one for up-regulation, and one for down- 
10 regulation) . Within each list, rank the results by the y 

value deviation. 
The results of such an analysis are shown in Figure 11, which 
□ demonstrates no evident variation in the repeated running of one 
rz. differential display sample. 

ms 

z!i Software Description 

l,^^ Referring to Figure 12, the data scoring software is 

fy preferably maintained in a version control system. After testing 
Wo has completed, program changes are committed. For each supported 
Q platform (Macintosh, Windows, Unix, etc.), an automated assembly 
Q computer retrieves the updated software and supporting data from 
the version control server, preferably over a computer network. 
Then, this computer compiles the software and data for run time 
25 operation on the specific target platform, and follows automation 
scripts to assemble the materials into an installer process, 
preferably for CD-ROMs or internet installation (e.g., 
MindVision's VISE for the Macintosh, or InstallShield for 
Windows) . Hypertext documentation for the software is maintained, 
30 updated, and compiled in a cross-platform format, such as 
bookmarked PDF files, preferably using automated document 
authoring software (e.g., Adobe's FrameMaker program). For each 
supported platform, the application program, associated data, and 



-55- 



documentation are included on the web or disk installers. Users 
preferably download or update their software using the web 
internet installer; alternatively, a disk installer can be used 
locally on a computer or local area network. 

5 

The automated scoring software maintains an audit trail 
of its actions, operation, and decisions as it processes the data 
according the steps of Figures 1 and 5. The data formats are kept 
operational for specific platfoinias by automatically checking and 

10 transforming certain files (e-g,, text representations) prior to 
the program's using these files on that platform. The data format 
permits multiple instrument data acquisition runs (e.g., gel or 

p; capillary loadings) to be processed together in a single computer 

Jy, analysis run. These software features reduce human intervention 

S5 and error. 

User feedback on the program's operation is preferably 
entered at a designated web site onto a reporting HTML form that 
J'^,- includes expected and observed program behavior. Via CGI sripts, 
^0 these reports are automatically logged onto a database (e.g-, 
^ FileMaker Pro on a Macintosh over a local area network) , and 
3l appropriate personnel may be automatically notified via email of 
Q potential problems . 

25 Business Considerations 

The automated quality maintenance system described 
herein for generating and analyzing DNA fragment data has a 
nonobvious business model. It is desirable for groups to generate 
30 and analyze their own data, since the users of the data often have 
the greatest incentive to maintain data quality. Moreover, people 
involved in the data generation task require continual feedback 
from the data analysis results. By generating data that includes 



-66" 



proper calibration reference standards (e.g,, internal size 
standards, allelic ladders, reference traces, etc.), high quality 
data can be automatically analyzed in ways that lower quality data 
cannot . 

The cost of manual scoring of data is quite high. The 
preferred business model includes a spreadsheet that permits an 
end-user to calculate their labor costs. Referring to Figure 13, 
a prospective or current customer can enter parameters related to 
the cost of labor, the data generation throughput, and how 
effectively the labor force analyzes the generated data. From 
these factors, the spreadsheet can calculate the total labor cost, 
as well as the labor cost per experiment, for the specific 
customer requirements. This calculating tool is preferably made 
available to customers in spreadsheet form (e.g., as a platform- 
independent Microsoft Excel 98 spreadsheet) as a computer file 
that can be distributed on disk, via email attachment, or 
downloadable from the internet. In a complementary embodiment, 
the spreadsheet functionality can be provided as an interative 
form on a web page. By better understanding the labor costs of 
data scoring, a customer can develop insight into the role of 
quality data and automated computer-based data scoring as a direct 
replacement for labor. 

Error is another significant cost in the human scoring 
of genetic data. In genetics, error severely compromises the 
power of linkage and other association measures, so that despite 
considerable research time, cost, and effort, genes are less 
likely to be discovered. In forensics, error can lead to 
incorrect identification of suspects, failed convictions of 
criminals, or failure in exonerating the innocent. Thus, methods 
that reduce error or improve data quality confer significant 
advantages to the user. 



-67- 



Since automated computer-based scoring of DNA sizing 
data is equivalent to hioman labor that performs the same task, the 
pricing model of such automated scoring is preferably based on 
usage. A fee is charged for every genotype scored. This fee 
preferably includes components for intellectual property, software 
support, and user customizations . With very high levels of usage, 
some components such as user customization can be reduced in line 
with the associated business expense reduction. 

For better market penetration, pricing levels should be 
set near or below the equivalent human labor cost. The result is 
an automated computer-based scoring process that is faster, better 
and cheaper than the equivalent hmaan review of data. 
Specifically, the computer-based process can produce more 
consistent results with lower error, leading to more productive 
use of the scored data. 

Additional services (including user training, system 
setup, software modifications, quality audits) are best charged 
separately. Preferably, there is a mandatory initial interaction 
with the customer group (for which the user pays) to train new 
users in the quality data generation and computer-based analysis 
process . 

Quality assurance is an integral part of the process and 
the business model. A quality entity (e.g., a corporation) can 
help support this quality maintenance system (QMS) . This entity 
can provide quality assurance by spot checking the user's data 
generation or scoring of the DNA fragment size data. This quality 
assurance can be done by (a) the entity providing the user with 
samples (characterized by the entity) for data generation or 
analysis, and then comparing the user's results with the entity's 
results, (b) the user providing the entity with samples 
(characterized by the user) for data generation or analysis, and 



-68- 



then having the entity comparing the entity's results with the 
user's results, (c) a comparison of data results involving a third 
party, or (d) a double-blind comparison study. 

5 The quality entity can use these quality assurance 

methods to conduct quality audits for different sites, and 
disseminate the results (e.g., by internet publication). Then, 
different data generation sites can compete with one another for 
business based on their quality and cost-effectiveness. Beyond a 
10 certain critical mass, it would be highly desirable for such DNA 
analysis sites to have certification provided by the quality 
entity on the basis of such audits. Those sites with the highest 
quality data and using the best automation software should be the 
If: most competitive. The quality entity can also provide a service 
^1^ (e.g., at regular time intervals, say annually) for quality checks 
y on sites that desire to achieve and maintain the best possible 
m data results . 

3 This procedure describes a method for generating revenue 

■|0 from computer scoring of genetic data. The steps include: (a) 
in supplying a software program that automatically scores genetic 
U data; (b) forming genetic data that can be scored by the software 
program; (c) scoring the genetic data using the software program 
to form a quantity of genetic data; and (d) generating a revenue 
25 from computer scoring of genetic data that is related to the 
quantity. Moreover, additional process steps include: (e) 
defining a labor cost of scoring the quantity of genetic data when 
not using the software program; (f) providing a calculating 
mechanism for estimating the labor cost from the quantity; (g) 
30 determining the labor cost based on the quantity; and (h) 
establishing a price for using the software program that is 
related to the labor cost. 



-69- 



Mixture Analysis 

In forensic science, DNA samples are often derived from 
more than one individual. With the advent of quantitative 
5 analysis of STR data, there is the possibility of computer-based 
analysis that can resolve these data. Specifically, there is a 
need to find or confirm the identity of component DNA profiles, as 
well as determine mixture ratios. In the preferred embodiment, 
the quantitation of the DNA samples is accomplished by performing 
10 Steps 1-6 of Figure 1, and Steps 7-9 of Figure 5. The accurate 
quantitation conducted in Step 9 enables accurate analysis of 
^^ntitative mixture data, and improves on the prior art that uses 
S unmodeled peak area or peak height (GeneScan Software, PE 
Ijl Biosystems, Foster City, CA) which provide potentially inaccurate 
|te estimates of DNA concentration. The invention's quantitative 
yj mixture analysis is part of Step 10 of Figure 5 in the case of DNA 
51 mixtures . 

g The present invention is distinguished over the prior 

^ art in that it uses a linear mathematical problem solving 
yi formulation that combines information across loci, and can 
^ completely integrate this information automatically on a computing 
device. By inherently using all the information from all the loci 
in the formulation, robust solutions can be achieved. The prior 
25 art uses manual or Bayesian methods on a per locus basis that 
entail human intervention in generating or combining partial 
results. Such prior forensic mixture analysis methods have been 
described (P Gill, R Sparkes, R Pinchin, TM Clayton, JP Whitaker, 
and J Buckleton, "Interpreting simple STR mixtures using allele 
30 peak area," Forensic Sci. Int., vol. 91, pp. 41-53, 1998; IW 

Evett, P Gill, and JA Lambert, "Taking account of peak areas when 
interpreting mixed DNA profiles," J. Forensic Sci., vol. 43, pp. 
62-69, 1997; TM Clayton, JP Whitaker, R Sparkes, and R Gill, 
"Analysis and interpretation of mixed forensic stains using DNA 



-70- 



STR profiling, " Forensic Sci. Int., vol. 91, pp. 55-70, 1998), 
incorporated by reference. 

There are different formulations of the mixture problem. 
5 With a mixture profile derived from two individuals, problems 
include: 

• verifying the mixture and computing the mixture ratio, 

given the profiles of both individuals; 

• determining the profile of one individual and the mixture 
10 ratio, given the profile of another individual; and 

• determining the profiles of both individuals and the 

mixture ratio, given no other DNA profiles. 
^. These problems can be similarly extended to a number of 
If? individuals greater than two. The STR mixture method described 
^tf5 herein addresses all of these problem formulations. 

^ In the PGR amplification of a mixture, the amount of 

P each PGR product scales in rough proportion to relative weighting 
O of each component DNA template. This holds true whether the PGRs 
IZp are done separately, or combined in a multiplex reaction. Thus, 
Ul if two DNA samples A and B are in a PGR mixture with relative 

concentrations weighted as wA and wB (0 < wA,wB < 1, wA + wB = 1) , 
their corresponding signal peaks after detection will generally 
have peak quantitations (height or area) showing roughly the same 
25 proportion. Therefore, by observing the relative peak 

proportions, one can estimate the DNA mixture weighting. Note 
that mixture weights and ratios are interchangeable, since the 

[^] 

mixture weight is in one-to-one corresponce with the 

[A] + [5] 

. [A] 

mixture ratxo 



30 



[B] 



To mathematically represent the linear effect of the DNA 
sample weights (wA, wB, wG, ...), write the linear equation: 



-71- 



p=Gxw, 

where p is the pooled profile column vector, each column i of 
genotype matrix G represents the alleles in the genotype of 
individual i (taking allele values 0, 1, 2, with a total of 2 
5 alleles) , and w is the weight column vector that reflects the 
relative proportions of template DNA or PGR product* To 
illustrate this coupling of DNA mixture weights with predicted 
pool weights, if there are three individuals A, B, C represented 
in a mixture with weighting wA-0.5, wB=0.25, wC=0.25, and at one 
10 locus the genotypes are: 

A has allele 1 and allele 2, 
B has allele 1 and allele 3, and 
'1? C has allele 2 and allele 3, 

Ul then combining the vectors via the linear equations: 



alleles 






alleles 


alleles 


alleles 






wA 


in 






of 


of 


of 




X 


wB 


mixture 






A 


B 


C 






wC 



and representing each allele as a position in a column vector, 
produces the linear relationship: 



0.75" 






T 


T 


0" 






0.50' 


0.75 






1 


0 


1 




X 


0.25 


0.50 






0 


1 


1 






0.25 



Note that the siim of alleles in each allele column vector is 
20 normalized to equal two, the number of alleles present. 

With multiple loci, the weight vector w is identical 
across all the loci, since that is the underlying mixture in the 
DNA template. This coupling of loci can be represented in the 
25 linear equations by extending the column vectors p and G with more 
allele information for additional loci. To illustrate this 
coupling of DNA mixture weights across loci, add a second locus to 
the three individuals above, where at locus two the genotypes are: 

A has allele 1 and allele 2, 



-72- 



B has allele 2 and allele 3, and 
C has allele 3 and allele 4, 
then combining the vectors via the partitioning: 



locusl 
mixture 
alleles 



locusl 
mixture 
alleles 



locusl 

As 
alleles 



locusl 

As 
alleles 



locusl 

Bs 
alleles 



locusl 

Bs 
alleles 



locusl 

Cs 
alleles 



locusl 

Cs 
alleles 



wA 
wB 
wC 



and representing each allele as a position in a column vector, 
produces : 



0.75' 
0.75 
0.50 

0.50 
0.75 
0.50 
0.25 



X 



0.50 
0.25 
0.25 



10 



15 



With multiple loci, there is more data and greater confidence in 
estimates computed from the linear equations. 

More precisely, write the vector/matrix equation p = Gxw 
for mixture coupling (of individual and loci) as coupled linear 
equations that include the relevant data: 

J 

where for locus i, individual j, and allele k 

• p^3^ is the pooled proportion in the observed mixture of 

locus i, allele k; 

• ffijk genotype of individual j at locus i in allele k, 

taking values 0 (no contribution) , 1 (heterozyote or 



-73- 



hemizygote contribution) , or 2 (homozygote 
contribution) , though with anomalous chromosomes other 
integer values are possible; and 

• Wj is the weighting in the mixture of individual j . 

Given partial information about equation p = Gxw, other 
elements can be computed by solving the equation. Cases include: 

• When G and w are both known, then the data profile p can be 

predicted. This is useful in search algorithms. 

• When G and p are both known, then the weights w can be 

computed. This is useful in confirming a suspected 
mixture, and in search algorithms. 

• When p is known, inferences can be made about G and w, 

depending on the prior information available (such as 
partial knowledge of G) . This is useful in human 
identification applications. 

The DNA mixture is resolved in different ways, depending on the 

case. 

Assume throughout that the mixture profile data vector p 
has been normalized for each locus. That is, for each locus, let 
NumAlleles be the number of alleles found in data for that locus 
(typically NiomAlleles = 2, one for each chromosome) , For each 
allele element of the quantitation data within the locus, multiply 
by NxomAlleles, and divide by the sum (over the alleles) of all the 
quantitation values for that locus. Then, the sum of the 
normalized quantitation data is NumAlleles, as in the illustrative 
example above. 

To resolve DNA mixtures, perform the steps: (a) obtain 
DNA profile data that include a mixed sample; (b) represent the 
data in a linear equation; (c) derive a solution from the linear 
equation; and (d) resolve the DNA mixture from the solution. This 
procedure is illustrated in the following cases. 



-74- 



First consider the case where all the genotypes G and 
the mixture data p are known, and the mixture weights w need to be 
determined. This problem is resolved by solving the linear 
equations p = Gxw for w using SVD or some other matrix inversion 
method. Specifically, w can be estimated as: 

w = G\p 

using the SVD matrix division operation in MATLAB. 



Next consider the case of two individuals A and B where 
one of the two genotypes (say. A) is known, the mixture weights w 
are known, and the quantitative mixture data profile p is 
% available. Expand p = Gxw in this case as: 

P = wA-gA + wB-gB 

=1=5 where gA and gB are the genotype coluirai vectors of individuals A 

^ and B, and wA and wB = (1-wA) are their mixture weights. Then, to 

C- resolve the genotype, rewrite this equation as 
Q gB = (p - wA-gA) /wB 

and solve for gB by matrix operations. The computed gB is the 
normalized difference of the mixture profile minus A's genotype, 
p The accuracy of the solution increases with the number of loci 
used, and the quality of the quantitative data. 

Next consider the case of making inferences about the 
25 genotype matrix G starting from a mixture profile p. This case 
has utility in forensic science. In one typical scenario, a stain 
from a crime scene may contain a DNA mixture from the victim and 
an unknown individual, the victim's DNA is available, and the 
investigator would like to connect the unknown individual's DNA 
profile with a candidate perpetrator. This scenario typically 
occurs in rape cases. The perpetrator may be a specific suspect, 
or the investigator may wish to check the unknown individual's DNA 
profile against a DNA database of possible candidates. If the 



30 



mixture weight wA were known, then the genotype gB could be 
computed immediately from the matrix difference operation of the 
preceding paragraph. 

Since wA is not known, a workable approach is to search 
for the best w in the [0,1] interval that satisfies additional 
constraints on the problem, set wA equal to this best w, compute 
the genotype g(wA) as a function of this optimized wA value, and 
set gB = g(wA) . A suitable constraint is the prior knowledge of 
the form that possible solution genotype vectors g can take. It 
is known that solutions must have a valid genotype subvector at 
each locus (e.g., having alleles taking on values 0, 1 or 2, and 
summing to 2) . This knowledge can be translated into a heuristic 
function of g(w) which evaluates each candidate genotype solution 
g against this criterion. 

In the preferred embodiment, the heuristic is a function 
of w, the profile p, and the known genotype gA. Since p and gA 
are fixed for any given problem, in this case the function depends 
only on the variable w. For any given w in (0,1), compute g(w) as 
(p - w-gA)/(l-w). Then, at each locus, compute and record the 

deviation dev,^^^ (g (w) ) . The dev^^^^ function at one locus is 
defined as: 

• Assxjme the genotype comprises one allele. Compute the 

deviation by finding the index of the largest peak, and 
forming a vector oneallele that has the value 2 at this 
index and is 0 elsewhere. Let devl be the L2 norm of the 
difference between g(w) and oneallele. 

• Assume the genotype comprises two alleles. Compute the 

deviation by finding the index of the two largest peaks, 
and forming a vector twoallele that has the value 1 at 
each of these two indices and is 0 elsewhere. Let dev2 



be the norm of the difference between g(w) and 
twoallele . 

• Return the the lesser of the two deviations as 
minimum (devl, dev2) . 
To compute dev(g(w) ) , sum the component dev^^^^^ (g (w) ) at each 
locus. That is, the heuristic function is the scalar value 

loci 

Appropriately optimize (e.g., minimize) this function over w in 
[0,1] to find wB, and determine gB as g(wB) . In one alternative 
preferred embodiment, the summation terms can be normalized to 
reflect alternative preferred weightings of the loci or alleles. 
In a different alternative preferred embodiment, various heuristic 
functions can be used that reflect other reasonable constraints on 
the genotype vectors, as in (P Gill, R Sparkes, R Pinchin, TM 
Clayton, JP Whitaker, and J Buckleton, "Interpreting simple STR 
mixtures using allele peak area," Forensic Sci . Int., vol. 91, pp. 
41-53, 1998), incorporated by reference. 

Referring to Step 10 of Figure 5, further quality 
assessment can be performed on the computed STR profile derived 
from the mixture analysis. As described, rule checking can 
identify potentially anomolous allele calls, particularly when 
peak quantities or sizes do not conform to expectations. Quality 
measures can be computed on the genotypes, which can indicate 
problematic calls even when no rule has fired. One preferred 
quality score in mixture analysis is the deviation dev(gB) of the 
computed genotype. Low deviations indicate a good result, whereas 
high scores suggest a poor result. These deviations are best 
interpreted when scaled relative to a set of calibration data 
which have been classified as correct or incorrect. Of particular 
utility is the partitioning of the deviations by locus, using the 
locus deviation function dev^^^^^CgB) . When a locus has an 
unusually high deviation, it can be dropped from the profile, and 



-77- 



the resulting partial profile can be used for hioman identity 
matching . 

The most preferred embodiment is demonstrated here on 
5 data generated using the 10 STR locus SGMplus panel (PE 

BioSystems, Foster City, CA) , and size separated and detected on 
an ABI/310 genetic analyzer sequencing instrument. A mixture 
proportion of 30% sample A and 70% sample B was used. Referring 
to Figure 14, a minimization search for weight w by computing 

10 dev{g(w)) gave a weighting of 29.73%, which is very close to the 

true 30%. The deviation dev(g(w)) of the computed genotype from 
the closest (and correct) feasible solution was 0.3199. The 

M computed genotype for all the loci is shown in the columns PROFILE 

11^; (exact) and GENO B (rounded) in the table below. 

y Data and results are shown in the table below, where 

m MIXTURE is the normalized peak qxiantitation data from the mixed 

sample, GENO A is the known genotype of individual A, PROFILE is 
q the nimerical genotype computed for determining B's genotype, 
jiO GENO B is the resulting integer genotype (and, in this case, the 
m known genotype as well), and DEVS are the square deviations of the 
M PROFILE from GENO B. Quality assessment of the computed PROFILE 

shows uniform peaks that are consistent with a correct genotype. 

Examination of the square deviation components for each allele 
25 reveals no significant problem, and the largest within locus stim 

of squares deviation was the small value 0.16 (for locus D21S11) . 



LOCUS-ALLELF. 


MIXTURE 


GENO A 


PROFILE 


GENO B 


DEVS 


D3S1358-14 


1.0365 


1.0000 


1.0520 


1.0000 


0.0027 


D3S1358-15 


0.9635 


1.0000 


0.9480 


1.0000 


0.0027 


vWA-17 


1.4755 


0 


2.0999 


2.0000 


0.0100 


vWA-18 


0.5245 


2.0000 


-0.0999 


0 


0.0100 


D16S539-11 


1.4452 


0 


2.0567 


2.0000 


0.0032 


D16S539-13 


0.2889 


1.0000 


-0.0120 


0 


0.0001 



-78- 





D16S539-14 


0 


.2660 


1. 


0000 


-0 


.0447 




0 


0 


.0020 




D2S1338-16 


0 


.3190 


1. 


0000 


0 


.0308 




0 


0 


.0010 




D2S1338-18 


0 


.6339 




0 


0 


.9021 


1. 


0000 


0 


.0096 




D2S1338-20 


0 


.3713 


1. 


0000 


0 


.1052 




0 


0 


.0111 


5 


D2S1338-21 


0 


.6758 




0 


0 


.9618 


1. 


0000 


0 


.0015 




D8S1179-9 


0 


.7279 




0 


1 


.0359 


1. 


0000 


0 


.0013 




D8S1179-12 


0 


.2749 


1. 


0000 


-0 


.0320 




0 


0 


.0010 




D8S1179-13 


0 


.6813 




0 


0 


.9696 


1. 


0000 


0 


.0009 




D8S1179-14 


0 


.3160 


1. 


0000 


0 


.0265 




0 


0 


.0007 


10 


D21S11-27 


0 


.2787 


1. 


0000 


-0 


.0265 




0 


0 


.0007 




D21S11-29 


0 


.7876 




0 


1 


.1208 


1. 


0000 


0 


.0146 




D21S11-30 


0 


.9337 


1. 


0000 


0 


.9057 


1. 


0000 


0 


.0089 




D18S51-12 


0 


.3443 


1. 


0000 


0 


.0669 




0 


0 


.0045 




D18S51-13 


0 


.6952 




0 


0 


.9894 


1. 


0000 


0 


.0001 


15 


D18S51-14 


0 


.6755 




0 


0 


.9613 


1. 


0000 


0 


.0015 


Ui 


D18S51-17 


0 


.2850 


1. 


0000 


-0 


.0176 




0 


0 


.0003 


m 


D19S433-12 .2 


0 


.6991 




0 


0 


.9949 


1. 


0000 


0 


.0000 


s 


D19S433-14 


0 


.6060 


2. 


0000 


0 


.0161 




0 


0 


.0003 


O 


D19S433-15 


0 


.6949 




0 


0 


.9890 


1. 


0000 


0 


.0001 




THOl-6 


0 


.3178 


1. 


0000 


0 


.0291 




0 


0 


.0008 




THOl-7 


1 


.0074 


1. 


0000 


1 


.0105 


1. 


0000 


0 


.0001 




THOl-9 


0 


.6749 




0 


0 


.9605 


1. 


0000 


0 


.0016 




FGA-19 


1 


.0580 


1. 


0000 


1 


.0826 


1. 


0000 


0 


.0068 




FGA-24 


0 


.2830 


1. 


0000 


-0 


.0203 




0 


0 


.0004 


25 


FGA-25.2 


0 


.6589 




0 


0 


.9378 


1. 


0000 


0 


.0039 



In an alternative preferred embodiment, the heuristic 
function can be extended to account for the curvature of 
deviations (as a function of w) , so that only local minima are 
30 considered and not boundary points. Genotype elimination 

constraints can be applied when there is extra knowledge, such 
when the mixture weight (hence proportion of sample B) is low and 
sample A's genotype can be excluded in certain cases. It is also 



-79- 

possible to provide for interactive human feecSback, so that expert 
assessments can work together with the computing method. 

When stutter peaks are a concern, use stutter 
5 deconvolution to mathematically remove the stutter artifact from 
the quantitative signal (MW Perlin, G Lancia, and S-K Ng, "Toward 
fully automated genotyping: genotyping microsatellite markers by 
deconvolution," Am, J. Hum. Genet., vol. 57, no. 5, pp. 1199-1210, 
1995), incorporated by reference. The prior forensic art uses 
10 Bayesian approaches to account for stutter (P Gill, R Sparkes, and 
JS Buckleton, "Interpretation of simple mixtures when artefacts 
such as stutters are present - with special reference to multiplex 
J: STRs using by the Forensic Science Service," Forensic Sci. Int., 
li\ vol. 95, pp. 213-224, 1998), incorporated by reference. However, 
l5 direct stutter removal from the signal can be highly robust, since 
it is working directly at the level of the stutter artifact, prior 
n to any mixture computation. 

□ In the case when both genotypes are unknown, use 

Jo additional search. Start from the linear equations p = Gxw. As 
iJf necessary, form feasible submatrices H( locus) of G, where each H 
p is an KX2 matrix, representing K alleles (rows) for 2 individuals 
(colurms) . Here, H = {g^^^} . where locus i is fixed, individual j 
G {1,2}, and allele k e {1, 2, . . . , K} . For example, the six 

25 feasible four allele (K=4) genotype pairs are represented by the 
six genotype matrices {H, | i = 1, 2, ...,6}: 

0 1] 

0 1 

1 0 
1 0 



H = 





8*21 




"1 0" 




"1 0" 




'1 0' 




"0 1" 




'0 1 




8*12 


8*22 




1 0 




0 1 




0 1 




1 0 




1 0 








s < 






















8*13 


8*23 




0 1 




1 0 




0 1 




1 0 




0 1 




8*14 


8*2A_ 




0 1 




0 1_ 




1 0 




0 1 




1 0 





-80- 



Matrix division (e.g., SVD) is performed using these H 
submatrices . The mixture profile data at each individual locus 
p( locus) is also used. Proceed as follows: 

1. Normalize the mixture data p to sum to 2 at each locus. 

2 . Find the best two element weighting vector w. On the 

subset of most infoirmative loci (typically those loci 
having four allele loci) : 
For each locus. 

For eve3ry valid genotype submatrix H at that locus, 

compute w( locus, H) = p ( locus )\H 

normalize w{ locus, H) to sum to 1 

set de\^( locus, H) to nor7i2{p( locus) - HXw ( locus, H) ' } 

Set w = w(locus,H) having the smallest dev( locus, H) . 

3. Derive the genotype H of each locus as follows: 
For every valid genotype matrix H. at that locus, 

compute dev(H.) = norm{p(locus) - H.x^'} 

Set H(locus) to that H. having the smallest dev(H.) . 

4. Assess the quality of the genotype result G formed by 

combining the {H(locus)} at each locus. This is 
preferably done by: 

(a) using the matrix operation w = G\p to estimate a 
second mixture weight w, and comparing with the first w 
found in Step 2, or 

(b) by examining the computed deviations dev(H (locus) ) 
found in Step 3 . 

Note that the operation ^^w(locus,H) = p(locus)\H" in this 
procedure for computing the best weight uses a matrix operation 
(i.e., matrix division). This procedure is preferable to finding 
w by minimizing nor7r2{p (locus) - HXv7 ( locus, H) } over w using brute 
force computation. 



-81" 



Herein, means or mechanism for language has been used. 
The presence of means is pursuant to 35 U.S.C. §112 paragraph and 
is subject thereto. The presence of mechanism is outside of 35 
U.S.C. §112 and is not Herein, means or mechanism for language has 
been used. The presence of means is pursuant to 35 U.S.C. §112 
paragraph and is subject thereto. The presence of mechanism is 
outside of 35 U.S.C. §112 and is not subject thereto. 

Although the invention has been described in detail in 
the foregoing embodiments for the purpose of illustration, it is 
to be understood that such detail is solely for that purpose and 
that variations can be made therein by those skilled in the art 
without departing from the spirit and scope of the invention 
except as it may be described by the following claims. 



-82- 



WHAT IS CLAIMED IS t 

1. A method for analyzing a nucleic acid sample 
comprised of the steps: 

(a) forming labeled DNA sample fragments from a nucleic 
acid sample; 

(b) size separating said sample fragments with size 
standard fragments, and detecting the fragments to form a sample 
signal and a size standard signal; 

(c) transfoinning the sample signal into size coordinates 
using the size standard signal; and 

(d) analyzing the nucleic acid sample in size 
coordinates . 

2. A method for analyzing a nucleic acid sample 
comprised of the steps: 

(a) forming labeled DNA sample fragments from a nucleic 
acid sample; 

(b) size separating and detecting said sample fragments 
to form a sample signal; 

(c) forming labeled DNA ladder fragments corresponding 
to molecular lengths; 

(d) size separating and detecting said ladder fragments 
to form a ladder signal; 



-83- 



(e) transforming the sample signal into length 
coordinates using the ladder signal; and 

(f) analyzing the nucleic acid sample signal in length 
5 coordinates . 

3. A method as described in Claim 2 wherein after the 
analyzing step (f) there is the additional step of determining a 
length or amount of a fragment in the nucleic acid sample. 

10 

4. A method as described in Claim 3 wherein after the 
determining step there is the additional step of finding a gene by 

^ positional cloning. 

fiS 5. A method as described in Claim 3 wherein after the 

Ly determining step there is the additional step of identifying an 
m individual by DNA profiling. 

□ 5. A method for generating revenue from computer scoring 

of genetic data comprised of the steps: 

M (a) supplying a software program that automatically 

scores genetic data; 

25 (b) forming genetic data that can be scored by the 

software program; 

(c) scoring the genetic data using the software program 
to form a quantity of genetic data; and 

30 

(d) generating a revenue from computer scoring of 
genetic data that is related to the quantity. 



-84" 



7. A method as described in Claim 6 wherein prior to the 
step (d) of generating a revenue there are the steps of: 

(e) defining a labor cost of scoring the quantity of 
5 genetic data when not using the software program; 

(f ) providing a calculating mechanism for estimating the 
labor cost from the quantity; 

10 (g) determining the labor cost based on the quantity; 

and 

i;, (h) establishing a price for using the software program 

|i that is related to the labor cost* 

|5 

8. A method as described in Claim 7 wherein the 
m calculating mechanism includes a spreadsheet. 

5 9, A method as described in Claim 7 wherein the 

Wo calculating mechanism is provided via the Internet. 

C 10. A method as described in Claim 7 wherein the 

calculating mechanism operates interactively via the Internet. 

25 11. A system for analyzing a nucleic acid sample 

comprising: 

(a) means for forming labeled DNA sample fragments from 
a nucleic acid sample; 

30 



(b) means for size separating and detecting said sample 
fragments to form a sample signal, said separating and detecting 
means in communication with the sample fragments; 



-85- 



(c) means for forming labeled DNA ladder fragments 
corresponding to molecular lengths; 

(d) means for size separating and detecting said ladder 
5 fragments to form a ladder signal, said separating and detecting 

means in communication with the ladder fragments; 

(e) means for transforming the sample signal into length 
coordinates using the ladder signal, said transforming means in 

10 communication with the signals; and 

(f) means for analyzing the nucleic acid sample signal 
y in length coordinates, said analyzing means in communication with 
^ the transforming means, 

l|5 

|f| 12. A method for producing a nucleic acid analysis 

m comprised of the steps: 

13 (a) analyzing a first nucleic acid sample on a first 

jflO size separation instrument to form a first signal; 

y (b) analyzing a second nucleic acid sample on a second 

size separation instrument to form a second signal; 

25 (c) comparing the first signal with the second signal in 

a computing device with memory to form a comparison; and 

(d) producing a nucleic acid analysis of the two samples 
from the comparison that is independent of the size separation 
30 instrioments used. 



13. A method as described in Claim 12 wherein the size 
separation instrument is a DNA sequencer that uses 
electrophoresis . 



-86- 



14. A method as described in Claim 12 wherein the 
nucleic acid analysis characterizes a size or amount of DNA in one 
of the nucleic acid samples. 

15. A method as described in Claim 12 wherein the 
nucleic acid analysis finds a gene by positional cloning. 

16. A method as described in Claim 12 wherein the 
nucleic acid analysis identifies an individual by DNA profiling. 

17 . A method for resolving DNA mixtures comprised of the 

steps : 

(a) obtaining DNA profile data that include a mixed 

sample; 

(b) representing the data in a linear equation; 

(c) deriving a solution from the linear equation; and 

(d) resolving the DNA mixture from the solution. 

18. A method as described in Claim 17 wherein the 
obtaining step (a) includes the step of performing a PCR on an STR 
locus of an individual. 

19. A method as described in Claim 17 wherein the 
representing step (b) includes a matrix or vector representation. 



20. A method as described in Claim 17 wherein the 
deriving step (c) includes an optimization procedure. 



-87- 



21. A method as described in Claim 17 wherein the 
deriving step (c) includes a matrix operation. 

22 . A method as described in Claim 17 wherein the 
resolving step (d) produces the genotype of an individual. 



"88- 



ABSTRACT OF THE DISCLOSURE 
A METHOD AND SYSTEM FOR DNA ANALYSIS 

The present invention pertains to a process for 
automatically analyzing nucleic acid samples. Specifically, the 
process comprises the steps of forming electrophoretic data of DNA 
samples with DNA ladders; comparing these data; transforming the 
coordinates of the DNA sample's data into DNA length coordinates; 
and analyzing the DNA sample in length coordinates. This analysis 
is useful for automating fragment analysis and quality assessment. 
The automation enables a business model based on usage, since it 
replaces (rather than assists) labor. This analysis also provides 
a mechanism whereby data generated on different instruments can be 
confidently compared. Genetic applications of this invention 
include gene discovery, genetic diagnosis, and drug discovery. 
Forensic applications include identifying people and their 
relatives, catching perpetrators, analyzing DNA mixtures, and 
exonerating innocent suspects. 



Figure 1. 



1 acquire data 

2 process signal 

3 separate colors 

4 remove primers 

5 track sizes 

6 extract profiles 



Figure 2. 




Figure 4. 



□ 



1 Size Profiles i 



mm 




• acquire data 

• process signal 

• separate colors' 

• remove primers 

• track sizes 

j| extract profiles 



310 320 " 

Ba3e Pairs 




Figure 5. 



7 derive allelic ladder 

8 transform coordinates 

9 quantitate trace 

1 0 analyze data 



Figure 6. 



□ 



i Size Profiles j 



1800 
1600 
1400 
1200 
1000 



HA99-0004 D16 L18 



« derive ^llelic ladder 

• transform coordinates 

• quantitate trace 

• call alleles 



^ 800 



600 



400 



200 



220 



230 





240 



250 
Base Pairs 



260 



270 



280 



Figure 7. 



□ 



1 Allele View! 



mm 



Final 



312 324 



sgm+.gel : D2S1338 : 12 rsamplel 1 

Quality 0.92511 



Priority 



29 



1207 



D2S 1 338 (S-fam): 288-348bp fcetranucieotide 



priority ▼ | 




1 - 



-i — n — r-r 



[DNA] 



allele 324.3 
ladder 324.3 
label 324 




286 294 302 310 318 326 334 342 



350 



ALLELES 



T : 1 I I : — I — : — r 



-I I—I Lj L-i 



286 294 302 310 318 326 334 342 



350 



Figure 8. 



sgrH-.gel sampa;e31 (L33): 1)231338 l^l 



GEHOTPES 

Laliel Designatian MlelaSiae LadderSiae Deviation 

300 17 300.4 300.4 +0.0 

324 23 324.3 324.3 +0.0 

Scare = 0.0000 

KULES FIRED 
number fired = 2 
> peak centeriig 

The designated peaks shouLd oorprise a certain percentage of tie signal. 
This profile's designated peaks carttain less than that percentage. 
>tliLrd peak 

One or tro peaks shofuLd contairv most of the USk signal. 

In this gemtype^ a third peek contained too mudi Wk signal. 



Laliel 


Size 


Henghtr 


Area 


PeakFit 


296 


296.6 


51.3 


35.9 


0.8165 


297 




33.9 


17.9 


0.8014 


300 


300.4 


923.1 


249.2 


0.9249 


302 


301.9 


387.4 


187.6 


0.9355 


320 


320.5 


23.1 


16.2 


0.7864 


321 


321.6 


21.7 


9.5 


0.8252 


324 


324.3 


380.3 


210.9 


0,9328 


32S 


325.9 


294.3 


156.3 


0.9036 



Figure 9. 



□ 



i Size Profiles i 



mm 




100 



Query&present artifacts 
Eg: "size std tracking" 



ii 



150 200 250 300 350 

Base Pairs 



Figure 10. 



1 02 nucl^ acid sample 
1 04 forming means 



1 1 2 forming means 



1 06 labeled DNA sample fragments 

i 

1 08 size separating and detecting means 



1 1 4 labeled DNA ladder fragments 



i 



1 1 0 sample signal 



I 

1 1 6 size separating and detecting means 

I 

1 1 8 ladder signal 



1 20 transforming means 

I 

122 sample signal in length coordinates 

I 

1 24 analyzing means 



Figure 1 1 . 



I DifTerential Display 




50 100 150 200 250 300 350 400 450 500 



Figure 12 



AcrivlfV 
Coa» 



TEST 



PACIUaE 



DISTRIBtAE 




I CD00 ^ chapjft ll£f 



COLLftTE 



t 

t 



EDFT 



/pertext" 



Assem) 
PDF hypertext 
user manuals 



iwt>i&npijr» 



Test 



WQRUA' 



I 

T 

[« ccc 



"CODE xft 



Assemble 
cross-platform 
software 




Oiilck check 



E-dlstribution 



Figure 13 



I LaborCost I 



□ El 





A 


B 


C D 


F 


F 1 1 

1 E L 


20 












21 


PEOPLE COST 


$1,024,000 








22 


PER GENOTYPE 


$1.02 








23 












24 


Break do vn 


per person 




D er dav 




25 


salary 


$25 000 


ThroiiahDul 


26 


benefib 


$6,250 


runs 


8 

_ ° 

4,000 


U W U 

1,000,000 


27 


3pace 


$2^000 


genotypes 


28 


computer 


$2,000 


29 


software 


$10,000 


Scoring 






30 


management 


$6,250 


calls/ person 


500 


125,000 


31 


overhead 


$12,500 






32 


COST 


$64,000 


PEOPLE 


16 




33 










34 


Assumptions 




Assumptions 






35 


benefit ret e 


0.25 


genotypes/run 


500 




36 


sq feet/person 


100 


days/year 


250 




37 


cost/sqfootyr 


$20 


people/call 


2 




38 


managing rate 


0.25 






39 


overhead rate 


0.50 








40 













LaborCost / 



Figure 14. 




Deciaration and Power of Attorney For Patent Application 
English Language Declaration 

As a bekMv named inventor, \ hereby dedare that: 

My residence, post office address and citizenship are as staled beicm next to my name, 

i beMeve I am the original, first and so»e inventor (if onty one name is listed betow) or an onginaJ, 
fifst and jomt inventor (if plural names are listed betow) of the subject matter which is datmed and 
for which a patent is sought on the irivention entitled 

A METHOD AND SYSTEM FOR DNA ANALYSIS 

the specification of which 

(check one) 

^ is attached hereto. 

□ was filed on ^ 

Application Serial No. OL/ ^ — 

and was amended on " 

(if appii ca fat e ) 

I hereby state thm I have reviewed arKl understand the contents of the above identifi^ 
including the daims. as amended by any amerxjment referred to above. 

I acknowiedge the duty to disctose information which is matenal to the examinati^ 
in accordance with Title 37. Code of Federal Regulatx)ns. §1 .56(a). 

i hereby daim foreign priority benefits under Title 36, United States Code, §119 of any foreign 
applk:atkin(s) for patent or inventor's certiffcate listed betow and have also tdentHied betow any 
foreign ^jplicatkxi kx patent or inventor's certifkarte havi^ 
on whk:h pnority is claimed: 

Prior Foreign Appfcatk)n(s) PfKyity Claimed 



(Number) 


(Country) 


(Oay/MonthA'ear Fried) 


(Number) 


(Country) 


(Day/Morrth/Year Fried) 


(Number) 


(Countiy) 


(Oay/Mooth/Year Fried) 



2 



I hereby daim the benefit under Trtte 35, United States Code, §120 of any United States appi»cation(s) 
listed betow and, irtsofar as the subject matter of each of the ciajnis of this appik^^ is not disctosed 
in the prwr United States applkalkxi in the manner provkled by the first paragraph of Title 35. Unrt«J 
States Code §112, I ack n owledge the duty to dtsctose matenal informatoon as defined m Title 37, 
Code of Federal Regulatkxis, §1 .56(a) whk:h occurred between the fiH^ 
mi the natwnal or PCT intematkxwl filing date of this appikation: 

' pmammtdi tu i» im oete»-u.s. ocPAWTMPft or < 



Peciaration and Power of Attorney— English Language [1-12]— page 1 of 2) 



2 012 



-0/ 



(Application Senai No.) 



{Fxkng Dale) 



(Appticatton S«rtal No.) 



(Ring Oata^ 



(Status) 

(patented, pendMig. abandoned) 



(Status) 

(patented, pendwig* dbandoned) 



] hereby declare that alt statements made herein of my own knowledge are true and that att 
statements made on information and belief are believed to be true; and further that these statements 
were made with the knowHedge tttat wiUfui false statements and the Itke so made are puriishabie 
by fine or impnsonment. or both, under Sectxxi 1001 of Title 18 of the United States Code and that 
such mWiui false statements may jeopardize the validity of the application or any patent issued 
thereon. 

POWER OF ATTORNEY: As a named inventor. I hereby appoint the following attomey<s} and/or 
agent(s} to prosecute this application and transact alt business in the Patent and Trademark Office 
connected therewith. {!ist name and regts^tjon number) 

Ansel M. Schwartz, Reg. No, 30,587 

SefHj Correspondence to: 

Ansel M, Schwartz 



412/621-9222 



Direct Telephone Calls to: {name and telephone number) 



FmH n»n* ot sote or itrst tnv«n(or 

Mark W. Perlin 



5904 Beacon Street, Pittsburgh, PA 15217 



0(i2«nsfMp 

United States 



Post QttKM Address 

5904 Beacon Street, Pittsburgh, PA 15217 



Fy« n»m» o* second jomt mvefwor. i( atty 



Second inventor s ^i^netuie 



Citiiensftio 



Po»* 0«»ce Address 



(Supply Similar information and signature for third and subsequent joint inventors.) 



I and TfMmrti Omoe-U^ OCPAflTIKNT Of ( 



(Declaration and Power of Attonney— English Language [1-12}— page 2 of 2) 



